Press "Enter" to skip to content

Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.