Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Related Posts

Pipeline Architecture With Kafka

Alexandra Wang describes how Pandora Media has used Apache Kafka for real-time ad serving using Kafka Connect: Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring […]

Read More

Text Normalization With Spark

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark.  First, they walk through normalizing the text: We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have […]

Read More

Categories

March 2017
MTWTFSS
« Feb Apr »
 12345
6789101112
13141516171819
20212223242526
2728293031