Press "Enter" to skip to content

Category: Streaming

Stateful Processing In Spark Streaming

Bill Chambers and Jules Damji look at a couple of stateful scenarios within Spark Streaming:

No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?

Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.

Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.

In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application.  This sounds like it’s more about preventing bursty duplicates from sensors.

Comments closed

Optimizing Apache Flink

Ivan Mushketyk has a few tips for speeding up programs using Apache Flink:

One more way to optimize your Flink application is to provide some information about what your user-defined functions are doing with input data. Since Flink can’t parse and understand code, you can provide crucial information that will help to build a more efficient execution plan. There are three annotations that we can use:

  1. @ForwardedFields: Specifies what fields in an input value were left unchanged and are used in an output value.

  2. @NotForwardedFields: Specifies fields that were not preserved in the same positions in the output.

  3. @ReadFields: Specifies what fields were used to compute a result value. You should only specify fields that were used in computations and not merely copied to the output.

Click through for his four tips.

Comments closed

Anomaly Detection With Kafka Streams

Ajmal Karuthakantakath shows us an application which performs fairly simple anomaly detection using Kafka Streams:

The problem is in the banking loan payment domain, where customers have taken a loan and they need to make monthly payments to repay the loan amount.

Assume there are millions of customers in the system and all these customers need to make monthly payments to their account. Each customer may have a different monthly due date depending on their monthly loan due date.

Each customer payment will appear as a PaymentScheduleEvent event. Customers can make more than one PaymentScheduleEvent per month. Each monthly due date for a customer will appear as a PaymentDueEvent.

An arbitrarily chosen anomaly condition for this example is that if the amount due is more than $150 for any customer at any point in time, this generates an anomaly.

Click through for instructions, the application, and further resources.  If you want to learn Kafka Streams, this should keep you busy for a little while.

Comments closed

Crossing The Streams With Kafka

Himani Arora shows how to join two Kafka streams together:

KStream-KStream Join

It is a sliding window join, that means, all tuples close to each other with regard to time are joined. Time here is the difference up to size of the window.

These joins are always windowed joins because otherwise, the size of the internal state store used to perform the join would grow indefinitely.

In the following example, we perform an inner join between two streams. The output the joined stream will be of type KStream<K, ...>

Read on to learn more about two additional join types.

Comments closed

Benchmarking Streaming Systems

Burak Yavuz shares a benchmark of Spark Streaming versus Flink and Kafka Streams:

At Databricks, we used Databricks Notebooks and cluster management to set up a reproducible benchmarking harness that compares the performance of Apache Spark’s Structured Streaming, running on Databricks Unified Analytics Platform, against other open source streaming systems such as Apache Kafka Streams and Apache Flink. In particular, we used the following systems and versions in our benchmarks:

The Yahoo Streaming Benchmark is a well-known benchmark used in industry to evaluate streaming systems. When setting up our benchmark, we wanted to push each streaming system to its absolute limits, yet keep the business logic the same as in the Yahoo Streaming Benchmark. We shared some of the results we achieved from these benchmarks during Spark Summit West 2017 keynote showing that Spark can reach 5x or higher throughput over other popular streaming systems. In this blog, we discuss in more detail about how we performed this benchmark, and how you can reproduce the results yourselves.

Standard vendor-based metric warnings aside, you can get the benchmark details at their GitHub repo.

Comments closed

Predicting Advertising Budgets With Kafka Streams

Boyang Chen explains how Pinterest uses Kafka Streams to reduce advertising overdelivery:

Overdelivery occurs when free ads are shown for out-of-budget advertisers. This reduces opportunities for advertisers with available budget to have their products and services discovered by potential customers.

Overdelivery is a difficult problem to solve for two reason:

  1. Real-time spend data: Information about ad impressions needs to be fed back into the system within seconds in order to shut down out-of-budget campaigns.

  2. Predictive spend: Fast, historical spend data isn’t enough. The system needs to be able to predict spend that might occur in the future and slow down campaigns close to reaching their budget. That’s because an inserted ad could remain available to be acted on by a user. This makes the spend information difficult to accurately measure in a short timeframe. Such a natural delay is inevitable, and the only thing we can be sure of is the ad insertion event.

This is a very interesting architectural overview.

1 Comment

Using Akka In A Streaming Solution

Artem Rukavytsia shows us how you can easily integrate Akka into a solution with Kafka and Spark Streaming:

Akka gives you the opportunity to make logic for producing/consuming messages from Kafka with the Actor model. It’s very convenient if actors are widely used in your code and it significantly simplifies making data pipelines with actors. For example, you have your Akka Cluster, one part of which allows you to crawl of web pages and the other part of which makes it possible to index and send indexed data to Kafka. The consumer can aggregate this logic. Producing data to Kafka looks as follows:

The Actor model, which Akka implements, is something I kind of understand, but have never spent much time trying to implement.  I can see how it’d make perfect sense communicating with Kafka, though, given the scale and independence of consumers within a consumer group that Kafka provides.

Comments closed

Using Spark Streaming On Kafka

Ayush Tiwari has an introductory tutorial on using Spark Streaming on top of Kafka:

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change.

In this blog, I am going to implement the basic example on Spark Structured Streaming & Kafka Integration.

This is a code-heavy tutorial, so check it out.

Comments closed

Joins With Kafka

Florian Trossbach and Matthias J Sax show the various sorts of joins offered in Kafka, both streams and tables:

Apache Kafka’s Streams API provides a very sophisticated API for joins that can handle many use cases in a scalable way. However, some join semantics might be surprising to developers as streaming join semantics differ from SQL semantics. Furthermore, the semantics of changelog streams and tombstone messages (that are used for deletes) are a new concept in stream processing.

Kafka’s journey from Pub/Sub broker to distributed streaming platform is well underway, and our times as engineers are very exciting!

I didn’t know you could join streams together in Kafka, so that’s really cool.

Comments closed

Creating A Simple Kafka Streams Application

Bill Bejeck has built a simple Kafka Streams application for us:

This blog post will quickly get you off the ground and show you how Kafka Streams works. We’re going to make a toy application that takes incoming messages and upper-cases the text of those messages, effectively yelling at anyone who reads the message. This application is called the yelling application.

Before diving into the code, let’s take a look at the processing topology you’ll assemble for this “yelling” application. We’ll build a processing graph topology, where each node in the graph has a particular function.

His entire application is 20 lines of code but it does function as a valid Kafka Streams app and works well as a demo.

Comments closed