Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Related Posts

Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark: Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often […]

Read More

Creating an Azure Databricks Cluster

Brad Llewellyn shows how you can create an Azure Databricks cluster: There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we’re going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves […]

Read More


February 2019
« Jan Mar »