Structured Streaming With Spark

Tathagata Das, et al, discuss enterprise-grade streaming of structured data using Spark:

Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.

This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.

Performance Tuning A Streaming Application

Mathieu Dumoulin explains how he was able to get 10x performance out of a streaming application built around Kafka, Spark Streaming, and Apache Ignite:

The main issues for these applications were caused by trying to run a development system’s code, tested on AWS instances on a physical, on-premise cluster running on real data. The original developer was never given access to the production cluster or the real data.

Apache Ignite was a huge source of problems, principally because it is such a new project that nobody had any real experience with it and also because it is not a very mature project yet.

I found this article fascinating, particularly because the answer was a lot more than just “throw some more hardware at the problem.”

Streaming Data To Power BI

Sacha Tomey builds a quick Powershell script to feed streaming data into Power BI:

Peter showed off three mechanisms for streaming data to a real time dashboard:

  • The Power BI Rest API
  • Azure Stream Analytics
  • Streaming Datasets

We’ve done a fair bit at Adatis with the first two and whilst I was aware of the August 2016 feature, Streaming Datasets I’d never got round to looking at them in depth. Now, having seen them in action I wish I had – they are much quicker to set up than the other two options and require little to no development effort to get going – pretty good for demo scenarios or when you want to get something streaming pretty quickly at low cost.

Click through for more details and a sample script.

Spark Versus Flink

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams.  We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Streaming Data With Kinesis

Asaaf Mentzer shows how to join streaming data (specifically, AWS Kinesis) with lookup data:

In this use case, Amazon Kinesis Analytics can be used to define a reference data input on S3, and use S3 for enriching a streaming data source.

For example, bike share systems around the world can publish data files about available bikes and docks, at each station, in real time.  On bike-share system data feeds that follow the General Bikeshare Feed Specification (GBFS), there is a reference dataset that contains a static list of all stations, their capacities, and locations.

There are three different architectures in here, so if you’re looking for streaming data models with Kinesis (or want to apply them to Kafka), this is a solid read.

Hortonworks Data Flow 2.1

Wei Wang and Haimo Liu announce Hortonworks Data Flow version 2.1:

In the release of HDF 2.1, data flow administrators within the enterprise can identify that in order for certain potential processors to be added to a working data flow system, additional authorization would be required.

In addition, HDF 2.1 supports over 180 processors including newly introduced Connect/Listen/PutWebSocket, Put/FetchElasticsearch5, ValidateCsv, etc.

HDF is Hortonworks’s big play on simplifying streaming operations in Hadoop.

Log Aggregation With Kafka And Redis

Asaf Yigal has a two-part series on comparing Apache Kafka and Redis for moving log events into Elasticsearch.  Part 1 explains the technologies:

Redis is a bit different from Kafka in terms of its storage and various functionalities. At its core, Redis is an in-memory data store that can be used as a high-performance database, a cache, and a message broker. It is perfect for real-time data processing.

The various data structures supported by Redis are strings, hashes, lists, sets, and sorted sets. Redis also has various clients written in several languages which can be used to write custom programs for the insertion and retrieval of data. This is an advantage over Kafka since Kafka only has a Java client. The main similarity between the two is that they both provide a messaging service. But for the purpose of log aggregation, we can use Redis’ various data structures to do it more efficiently.

Part 2 compares the two technologies and explains which works better when:

Kafka heavily relies on the machine memory (RAM). As we see in the previous graph, utilizing the memory and storage is an optimal way to maintain a steady throughput. Its performance depends on the data consumption rate. In the case that consumers don’t consume data fast enough, Kafka will have to read from a disk and not from memory which will slow down its performance.

As you might expect, the answer for which technology to use is “it depends.”

Scaling Kinesis Streams

Allan MacInnis shows how to scale Amazon Kinesis streams using the UpdateShardCount API call:

You also need to adjust the alarm threshold to accommodate for the new shard capacity automatically. For this example, update the alarm threshold to 80% of your new capacity (or 3200 records per second) by setting a CloudWatch alarm with an action to publish to a SNS topic when the alarm is triggered.

You can then create a Lambda function that subscribes to this SNS topic and executes a call to the new UpdateShardCount API operation while adjusting the CloudWatch alarm threshold. To learn how to configure a Cloudwatch alarm, see Creating Amazon Cloudwatch Alarms. For information about how to invoke a Lambda function from SNS, see Invoking Lambda Functions Using Amazon SNS Notifications.

This is pretty cool.

Stream Computing Platform

Ravi Peri shows how to set up the Stream Computing Platform for .NET (SCP.Net) library and kick off a job:

SCP.Net generates a zip file consisting of the topology DLLs and dependency jars.

It uses Java (if found in the PATH) or .net to generate the zip. Unfortunately, zip files generated with .net are not compatible with Linux clusters.

If you’re interesting in working with a Storm topology while writing .NET code, check this out.

Kafka Enrichment

I have an article on enriching data stored in a Kafka topic:

We’re going a bunch of setup work here, so let’s take it from the top.  First, I declare a consumer group, which I’m calling “Airplane Enricher.”  Kafka uses the concept of consumer groups to allow consumers to work in parallel.  Imagine that we have ten separate servers available to process messages from the Flights topic.  Each flight message is independent, so it doesn’t matter which consumer gets it.  What does matter, though, is that multiple consumers don’t get the same message, as that’s a waste of resources and could lead to duplicate data processing, which would be bad.

The way Kafka works around this is to use consumer groups:  within a consumer group, only one consumer will get a particular message.  That way, I can have my ten servers processing messages “for real” and maybe have another consumer in a different consumer group just reading through the messages getting a count of how many records are in the topic.  Once you treat topics as logs rather than queues, consumer design changes significantly.

This is a fairly lengthy read, but directly business-applicable, so I think it’s well worth it.

Categories

March 2017
MTWTFSS
« Feb  
 12345
6789101112
13141516171819
20212223242526
2728293031