Kafka Plus Spark Streaming

Prasad Alle shows how to integrate Kafka with Spark Streaming on AWS:

Stream processing walkthrough

The entire pattern can be implemented in a few simple steps:

  1. Set up Kafka on AWS.

  2. Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.

  3. Create a Kafka topic.

  4. Run the Spark Streaming app to process clickstream events.

  5. Use the Kafka producer app to publish clickstream events into Kafka topic.

  6. Explore clickstream events data with SparkSQL.

This is a pretty easy-to-follow walkthrough with some good tips at the end.

Related Posts

Joining Multiple Types Of Data With KSQL

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL: This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two […]

Read More

Kafka Partitioning Strategies

Amy Boyle shares some thoughts on Kafka partitioning strategy: If you have enough load that you need more than a single instance of your application, you need to partition your data. The producer clients decide which topic partition data ends up in, but it’s what the consumer applications will do with that data that drives […]

Read More


October 2016
« Sep Nov »