Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Related Posts

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More

Categories

March 2017
MTWTFSS
« Feb Apr »
 12345
6789101112
13141516171819
20212223242526
2728293031