Structured Streaming With Spark

Tathagata Das, et al, discuss enterprise-grade streaming of structured data using Spark:

Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.

This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.

Related Posts

Temporal Tables with Flink

Marta Paes shows off a new feature in Apache Flink: In the 1.7 release, Flink has introduced the concept of temporal tables into its streaming SQL and Table API: parameterized views on append-only tables — or, any table that only allows records to be inserted, never updated or deleted — that are interpreted as a changelog and […]

Read More

Optimizing Kafka Streams Apps

Bill Bejeck and Guozhang Wang give us an idea of some Kafka Streams internals: At a high level, when you use the Streams DSL, it auto-creates the processor nodes as well as state stores if needed, and connects them to construct the processor topology. To dig a little deeper, let’s take an example and focus […]

Read More


January 2017
« Dec Feb »