Stateful Processing In Spark Streaming

Bill Chambers and Jules Damji look at a couple of stateful scenarios within Spark Streaming:

No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?

Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.

Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.

In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application.  This sounds like it’s more about preventing bursty duplicates from sensors.

Related Posts

Security Improvements In Kafka And Confluent Platform

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0: Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature […]

Read More

SparkSession Versus SparkContext

Abhishek Baranwal explains the differences between the SparkSession object and the SparkContext object when writing Spark code: Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager. SparkConf is required to create the spark context object, […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031