Hard Problems In Stream Processing

Kartik Paramasivam discusses tough issues within the Lambda architecture:

During a data center failover like the exampleabove, we could have a “late arrival,” i.e. the stream processor might see the AdClickEvent possibly a few minutes after the AdViewEvent. A poorly written stream processor might deduce that the ad was a low-quality ad when instead the ad might have actually been good. Another anomaly is that the stream processor might see the AdClickEvent before it sees the corresponding AdViewEvent. To ensure that the output of the stream processor is correct there has to be logic to handle this “out of order message arrival.”

In the example above, the geo-distributed nature of the data centers makes it easy to explain the delays. However delays can exist even within the same data center due to GC issues, Kafka cluster upgrades, partition rebalances, and other naturally occurring distributed system phenomena.

This is a pretty long article and absolutely worth a read if you are looking at streaming data.

Related Posts

Working With Skewed Data In Pig

Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data: The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s […]

Read More

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming: Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot […]

Read More


June 2016
« May Jul »