Press "Enter" to skip to content

Category: Streaming

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Comments closed

Lessons Learned From A Kafka Streams Implementation

Rishi Dhanaraj provides us with some lessons learned from implementing Kafka Streams to read data from Cassandra and Mongo and write into Mongo:

This Python script ran on a single machine, and is from the early days of the company. However, this script didn’t scale since it cannot run in a distributed manner. As a result, this Python job ends up flapping—crashing and restarting regularly in production depending on the load it needs to process.

Second, the Python script puts read pressure on MongoDB and Cassandra, because it has to query the databases for each batch of walk-ins and Zenreach Messages. MongoDB and Cassandra are our primary databases for serving customer read queries. So we wanted to remove the additional read pressure added by this job, which currently competes for resources with our customers.

For these reasons, we wanted to move to a streaming solution—specifically, Kafka Streams. We already switched to Kafka Streams for walk-in detection, which my teammate Eugen Feller explained in a previous post.

Click through for a review of the architecture and some tips if you want to do this yourself.

Comments closed

Stream Analytics And Power BI

Brad Llewellyn gives us a demo on connecting a Stream Analytics stream to Power BI for data analysis:

We understand that streaming data isn’t typically considered “Data Science” by itself.  However, it’s are often associated and setting up this background now opens up some cool applications in later posts.  For this post, we’ll cover how to sink streaming data to Power BI using Stream Analytics.

The previous posts in this series used Power BI Desktop for all of the showcases.  This post will be slightly different in that we will leverage the Power BI Service instead.  The Power BI Service is a collaborative web interface that has most of the same reporting capabilities as Power BI Desktop, but lacks the ability to model data at the time of writing.  However, we have heard whispers that data modeling capabilities may be coming to the service at some point.  The Power BI Service is also the standard method for sharing datasets, reports and dashboards across organizations.  For more information on the Power BI Service, read this.

Brad has a nice demo, so check it out.

Comments closed

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications:

When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.

Instead, developers can unit test their Kafka Streams applications with utilities provided by kafka-streams-test-utils. Introduced in KIP-247, this artifact was specifically created to help developers test their code, and it can be added into your continuous integration and continuous delivery (CI/CD) pipeline.

Streaming applications need tested just like any other.

Comments closed

Generating Test Data In Kafka

Yeva Byzek takes us through the Kafka Connect Datagen connector:

Short of using real data from a real source, you do have a few options on how to generate more interesting test data for your topics. One option is to write your own client. Kafka has many programming language options—you choose: Java, Python, Go, .NET, Erlang, Rust—the list goes on. You can write your own Kafka client applications that produce any kind of records to a Kafka topic, and then you’re set.
But wouldn’t it be great if you could generate data locally to just fill topics with messages? Fortunately, you’re in luck! Because we have those data generators.

Click through for a demonstration.

Comments closed

KSQL Deployment Options

Hojjat Jafarpour shows us two deployment options for Kafka Streams with KSQL:

As I mentioned, we have implemented KSQL on top of the Kafka Streams API. This means that every KSQL query is compiled into a Kafka Streams application. Therefore, KSQL queries follow the same execution model of Kafka Streams applications.
A query can be executed on multiple instances, and each instance will process a portion of the input data from the input topic(s) as well as generate portions of the output data to output topic(s). Based on this execution model and depending on how we want to run our queries, currently, KSQL provides two deployment options.

Read on for those options.

Comments closed

Comparing Streaming Engines

George Vetticaden compares Spark Streaming, Storm, and Kafka Streams:

Before the addition of Kafka Streams support, HDP and HDF supported two stream processing engines:  Spark Structured Streaming and Streaming Analytics Manager (SAM) with Storm. So naturally, this begets the following question:
Why add a third stream processing engine to the platform?
With the choice of using Spark structured streaming or SAM with Storm support, customers had the choice to pick the right stream processing engine based on their non- functional requirements and use cases. However, neither of these engines addressed the following types of requirements that we saw from our customers:

And this doesn’t even include Samza or Flink, two other popular streaming engines.

My biased answer is, forget Storm.  If you have a legacy implementation of it, that’s fine, but I wouldn’t recommend new streaming implementations based off of it.  After that, you can compare the two competitors (as well as Samza and Flink) to see which fits your environment better.  I don’t think either of these has many scenarios where you completely regret going with, say, Kafka Streams instead of Spark Streaming.  Each has its advantages, but they’re not so radically different.

Comments closed

Using Databricks Delta In Lieu Of Lambda Architecture

Jose Mendes contrasts the Lambda architecture with the Databricks Delta architecture and gives us a quick example of using Databricks Delta:

The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, it is now possible to overcome such limitation if we have the possibility to change our approach.
Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Databricks Delta is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.

It’s an interesting contrast and I recommend reading the whole thing.

Comments closed

Configuring Kafka Streams For Least Privilege

Gwen Shapira explains how we can assign minimal rights to Kafka Streams and KSQL users:

The principle of least privilege dictates that each user and application will have the minimal privileges required to do their job. When applied to Apache Kafka® and its Streams API, it usually means that each team and application will have read and write access only to a selected few relevant topics.

Organizations need to balance developer velocity and security, which means that each organization will likely have their own requirements and best practices for access control.

There are two simple patterns you can use to easily configure the right privileges for any Kafka Streams application—one provides tighter security, and the other is for more agile organizations. First, we’ll start with a bit of background on why configuring proper privileges for Kafka Streams applications was challenging in the past.

Read the whole thing; “granting everybody all rights” generally isn’t a good idea, no matter what your data platform of choice may be.

Comments closed

Apache Samza At 1.0

Jagadish Venkatraman announces Apache Samza 1.0:

We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a distributed stream processing framework that we developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014. Fast-forward to 2018, and we currently have over 3,000 applications in production leveraging Samza at LinkedIn. The use-cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more. Today, Samza integrates not only with Apache Kafka, but also with many other systems, including Azure EventHubsAmazon Kinesis, HDFS, ElasticSearch, and Brooklin. Multiple companies like Slack, TripAdvisor, eBay, and Optimizely have adopted Samza.

We view Samza 1.0 as a step towards our vision of making stream processing universally accessible. In this post, we describe our journey in building and scaling a distributed stream processing system. We also present the key features in Samza 1.0: a rich high-level API, event-time-based processing, integration with Apache Beam, Samza SQL, a standalone mode to run Samza without YARN, and a new test framework for Samza applications.

This runs in the same space as Spark Streaming, Flink, and Kafka Streams, so there are plenty of competitors and a lot of innovation.

Comments closed