Sarfaraz Hussain has started a series on Spark Streaming. The first post gives an introduction to the topic:
The philosophy behind the development of Structured Streaming is that,
“We as end user should not have to reason about streaming”.
What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.
Sarfaraz then follows this up with a bit on the structure of a streaming query:
So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.
It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.
Check them both out.