Category: Streaming

Avro is a very efficient way of storing data in files, since the schema is written just once, at the beginning of the file, followed by any number of records (contrast this with JSON or XML, where each data element is tagged with metadata). Similarly, Avro is well suited to connection-oriented protocols, where participants can exchange schema data at the start of a session and exchange serialized records from that point on. Avro works less well in a message-oriented scenario since producers and consumers are loosely coupled and may read or write any number of records at a time. To ensure that the consumer has the correct schema, it must either be exchanged “out of band” or accompany every message. Unfortunately, sending the schema with every message imposes significant overhead — in many cases, the schema is as big as the data or even bigger!

Read on to see how the Confluent Schema Registry can solve this problem.

Comments closed

Tips For Running Kafka Streams On AWS

Published 2017-10-26 by Kevin Feasel

Ian Duffy and Nina Hanzlikova have some advice if you’re looking to spin up some EC2 instances to run Kafka Streams:

With upgrades in the underlying Kafka Streams library, the Kafka community introduced many improvements to the underlying stream configuration defaults. Where in previous, more unstable iterations of the client library we spent a lot of time tweaking config values such as session.timeout.ms, max.poll.interval.ms, and request.timeout.ms to achieve some level of stability.

With new releases we found ourselves discarding these custom values and achieving better results. However, some timeout issues persisted on some of our services, where a service would frequently get stuck in a rebalancing state. We noticed that reducing the max.poll.records value for the stream configs would sometimes alleviate issues experienced by these services. From partition lag profiles we also saw that the consuming issue seemed to be confined to only a few partitions, while the others would continue processing normally between re-balances. Ultimately we realised that the processing time for a record in these services could be very long (up to minutes) in some edge cases. Kafka has a fairly large maximum offset commit time before a stream consumer is considered dead (5 minutes) but with larger message batches of data this timeout was still being exceeded. By the time the processing of the record was finished the stream was already marked as failed and so the offset could not be committed. On rebalance, this same record would once again be fetched from Kafka, would fail to process in a timely manner and the situation would repeat. Therefore for any of the affected applications we introduced a processing timeout, ensuring there was an upper bound on the time taken by any of our edge cases.

There are some interesting tidbits in here.

Comments closed

Page Ranking With Kafka Streams

Published 2017-10-20 by Kevin Feasel

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

Sum across the columns and normalize, this becomes your Hub vector

Multiply the Hub vector element-wise across the adjacency matrix

Sum down the rows and normalize, this becomes your Authority vector

Multiply the Authority vector element-wise down the the adjacency matrix

Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point. We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Comments closed

Stateful Processing In Spark Streaming

Published 2017-10-20 by Kevin Feasel

Bill Chambers and Jules Damji look at a couple of stateful scenarios within Spark Streaming:

No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?

Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.

Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.

In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application. This sounds like it’s more about preventing bursty duplicates from sensors.

Comments closed

Optimizing Apache Flink

Published 2017-10-19 by Kevin Feasel

Ivan Mushketyk has a few tips for speeding up programs using Apache Flink:

One more way to optimize your Flink application is to provide some information about what your user-defined functions are doing with input data. Since Flink can’t parse and understand code, you can provide crucial information that will help to build a more efficient execution plan. There are three annotations that we can use:

@ForwardedFields: Specifies what fields in an input value were left unchanged and are used in an output value.
@NotForwardedFields: Specifies fields that were not preserved in the same positions in the output.
@ReadFields: Specifies what fields were used to compute a result value. You should only specify fields that were used in computations and not merely copied to the output.

Click through for his four tips.

Comments closed

Anomaly Detection With Kafka Streams

Published 2017-10-17 by Kevin Feasel

Ajmal Karuthakantakath shows us an application which performs fairly simple anomaly detection using Kafka Streams:

The problem is in the banking loan payment domain, where customers have taken a loan and they need to make monthly payments to repay the loan amount.

Assume there are millions of customers in the system and all these customers need to make monthly payments to their account. Each customer may have a different monthly due date depending on their monthly loan due date.

Each customer payment will appear as a PaymentScheduleEvent event. Customers can make more than one PaymentScheduleEvent per month. Each monthly due date for a customer will appear as a PaymentDueEvent.

An arbitrarily chosen anomaly condition for this example is that if the amount due is more than $150 for any customer at any point in time, this generates an anomaly.

Click through for instructions, the application, and further resources. If you want to learn Kafka Streams, this should keep you busy for a little while.

Comments closed

Crossing The Streams With Kafka

Published 2017-10-16 by Kevin Feasel

Himani Arora shows how to join two Kafka streams together:

KStream-KStream Join

It is a sliding window join, that means, all tuples close to each other with regard to time are joined. Time here is the difference up to size of the window.

These joins are always windowed joins because otherwise, the size of the internal state store used to perform the join would grow indefinitely.

In the following example, we perform an inner join between two streams. The output the joined stream will be of type KStream<K, ...>

Read on to learn more about two additional join types.

Comments closed

Benchmarking Streaming Systems

Published 2017-10-13 by Kevin Feasel

Burak Yavuz shares a benchmark of Spark Streaming versus Flink and Kafka Streams:

At Databricks, we used Databricks Notebooks and cluster management to set up a reproducible benchmarking harness that compares the performance of Apache Spark’s Structured Streaming, running on Databricks Unified Analytics Platform, against other open source streaming systems such as Apache Kafka Streams and Apache Flink. In particular, we used the following systems and versions in our benchmarks:

Databricks Runtime 3.1

Apache Flink 1.2.1

Kafka Streams 0.10.2.1

The Yahoo Streaming Benchmark is a well-known benchmark used in industry to evaluate streaming systems. When setting up our benchmark, we wanted to push each streaming system to its absolute limits, yet keep the business logic the same as in the Yahoo Streaming Benchmark. We shared some of the results we achieved from these benchmarks during Spark Summit West 2017 keynote showing that Spark can reach 5x or higher throughput over other popular streaming systems. In this blog, we discuss in more detail about how we performed this benchmark, and how you can reproduce the results yourselves.

Standard vendor-based metric warnings aside, you can get the benchmark details at their GitHub repo.

Comments closed

Predicting Advertising Budgets With Kafka Streams

Published 2017-10-11 by Kevin Feasel

Boyang Chen explains how Pinterest uses Kafka Streams to reduce advertising overdelivery:

Overdelivery occurs when free ads are shown for out-of-budget advertisers. This reduces opportunities for advertisers with available budget to have their products and services discovered by potential customers.

Overdelivery is a difficult problem to solve for two reason:

Real-time spend data: Information about ad impressions needs to be fed back into the system within seconds in order to shut down out-of-budget campaigns.
Predictive spend: Fast, historical spend data isn’t enough. The system needs to be able to predict spend that might occur in the future and slow down campaigns close to reaching their budget. That’s because an inserted ad could remain available to be acted on by a user. This makes the spend information difficult to accurately measure in a short timeframe. Such a natural delay is inevitable, and the only thing we can be sure of is the ad insertion event.

This is a very interesting architectural overview.

1 Comment

Using Akka In A Streaming Solution

Published 2017-10-05 by Kevin Feasel

Artem Rukavytsia shows us how you can easily integrate Akka into a solution with Kafka and Spark Streaming:

Akka gives you the opportunity to make logic for producing/consuming messages from Kafka with the Actor model. It’s very convenient if actors are widely used in your code and it significantly simplifies making data pipelines with actors. For example, you have your Akka Cluster, one part of which allows you to crawl of web pages and the other part of which makes it possible to index and send indexed data to Kafka. The consumer can aggregate this logic. Producing data to Kafka looks as follows:

The Actor model, which Akka implements, is something I kind of understand, but have never spent much time trying to implement. I can see how it’d make perfect sense communicating with Kafka, though, given the scale and independence of consumers within a consumer group that Kafka provides.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30