Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of  spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Related Posts

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More


April 2018
« Mar May »