Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Related Posts

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive: If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support. In this […]

Read More

Stream Reactor Update

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0: Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031