Kafka Streams And Time-Based Batching

Vladimir Vajda provides a warning for people using Kafka Streams:

To completely understand the problem, we will first go into detail how ingestion and processing occur by default in Kafka Streams. For example purposes, the punctuate method is configured to occur every ten seconds, and in the input stream, we have exactly one message per second. The purpose of the job is to parse input messages, collect them, and, in the punctuate method, do a batch insert in the database, then to send metrics.

After running the Kafka Stream application, the Processor will be created, followed by the initmethod. Here is where all the connections are established. Upon successful start, the application will listen to input topic for incoming messages. It will remain idle until the first message arrives. When the first message arrives, the process method is called — this is where transformations occur and where the result is stored for later use. If no messages are in the input topic, the application will go idle again, waiting for the next message. After each successful process, the application checks if punctuate should be called. In our case, we will have ten process calls followed by one punctuate call, with this cycle repeating indefinitely as long as there are messages.

A pretty obvious behavior, isn’t it? Then why is one bolded?

Read on for more, including how to handle this edge case.

Related Posts

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta: The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the […]

Read More

Working With Kafka At Scale

Tony Mancill has some tips for working with large-scale Kafka clusters: Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you’re operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this: First, consumers of the “hot” […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031