Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Related Posts

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta: The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the […]

Read More

Working With Kafka At Scale

Tony Mancill has some tips for working with large-scale Kafka clusters: Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you’re operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this: First, consumers of the “hot” […]

Read More


December 2017
« Nov Jan »