Understanding A Spark Streaming Workflow

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

  1. Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.

  2. Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.

  3. Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Related Posts

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930