Press "Enter" to skip to content

Late-Arriving Data with Spark Streaming

Sarfaraz Hussain continues a series on Spark streaming:

The size of the State (discussed in the previous blog) will continue to increase indefinitely as we really don’t know when a bucket can be closed.

But practically a query is not going to receive data 1 week late or in that matter such late-arriving data is of no use to us.

So, to specify the information when to stop considering older buckets for the streaming query we use Watermark.

Read on to see how you can design and implement a watermark.