Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of  spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Related Posts

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake: As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to […]

Read More

Cloudera Stream Processing

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor: Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on […]

Read More

Categories

April 2018
MTWTFSS
« Mar May »
 1
2345678
9101112131415
16171819202122
23242526272829
30