Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Related Posts

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase: Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini. Because PolyBase needs to have direct access to the data nodes, […]

Read More

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates: This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds […]

Read More

Categories

March 2017
MTWTFSS
« Feb Apr »
 12345
6789101112
13141516171819
20212223242526
2728293031