Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Related Posts

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class: Let me tell you about one of my least favorite things I like to see in PolyBase: Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster This error is not limited to PolyBase but is instead […]

Read More

Using the StreamSets Snowflake Destination

Dash Desai shows how you can use StreamSets to write data into SnowflakeDB: In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically. To illustrate, let’s take HTTP web server logs […]

Read More

Categories

February 2019
MTWTFSS
« Jan Mar »
 123
45678910
11121314151617
18192021222324
25262728