Press "Enter" to skip to content

Trying out Spark Streaming

Tomaz Kastrun continues a series on Spark. Part 15 provides an introduction to Spark Streaming:

Spark Streaming or Structured Streaming is a scalable and fault-tolerant, end-to-end stream processing engine. it is built on the Spark SQL engine. Spark SQL engine will is responsible for running results sets for streaming data, regardless of static or continuously in coming stream data.

Spark stream can use Dataframe (or Datasets) API in Scala, Python, R or Java to work on handling data ingest, creating streaming analytics and do all the computations. All these requests and workloads are done against Spark SQL engine.

I don’t think I’ve ever seen an example of using Spark Streaming in R, so that one’s new to me.

Part 16 looks at DataFrame operations in Spark Streaming:

When working with Spark Streaming from file based ingestion, user must predefine the schema. This will require not only better performance but consistent data ingest for streaming data. There is always possibility to set the spark.sql.streaming.schemaInference to true to enable Spark to infer schema on read or automatically.

Check out both of those posts.