Kafka + Spark Streaming

Kunal Khamar, et al, show how to integrate Apache Kafka with Spark’s structured streaming:

Kafka is a distributed pub-sub messaging system that is popular for ingesting real-time data streams and making them available to downstream consumers in a parallel and fault-tolerant manner. This renders Kafka suitable for building real-time streaming data pipelines that reliably move data between heterogeneous processing systems. Before we dive into the details of Structured Streaming’s Kafka support, let’s recap some basic concepts and terms.

Data in Kafka is organized into topics that are split into partitions for parallelism. Each partition is an ordered, immutable sequence of records, and can be thought of as a structured commit log. Producers append records to the tail of these logs and consumers read the logs at their own pace. Multiple consumers can subscribe to a topic and receive incoming records as they arrive. As new records arrive to a partition in a Kafka topic, they are assigned a sequential id number called the offset. A Kafka cluster retains all published records—whether or not they have been consumed—for a configurable retention period, after which they are marked for deletion.

Read the whole thing.

Related Posts

Spark Changes In HDP 2.6

Vinay Shukla and Syed Mahmood talk about what’s new with Spark and Zeppelin in the Hortonworks Data Platform 2.6 update: SPARKR & PYSPARK Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark […]

Read More

Presto On HDInsight

Ashish Thapliyal shows how to install Presto on an HDInsight cluster: What is Presto? Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto is becoming popular SQL interactive query engine that has grabbed the attention and mind-share […]

Read More

Categories

May 2017
MTWTFSS
« Apr  
1234567
891011121314
15161718192021
22232425262728
293031