Category: Streaming

Using Spark Streaming On Kafka

Published 2017-09-27 by Kevin Feasel

Ayush Tiwari has an introductory tutorial on using Spark Streaming on top of Kafka:

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change.

In this blog, I am going to implement the basic example on Spark Structured Streaming & Kafka Integration.

This is a code-heavy tutorial, so check it out.

Comments closed

Joins With Kafka

Published 2017-09-22 by Kevin Feasel

Florian Trossbach and Matthias J Sax show the various sorts of joins offered in Kafka, both streams and tables:

Apache Kafka’s Streams API provides a very sophisticated API for joins that can handle many use cases in a scalable way. However, some join semantics might be surprising to developers as streaming join semantics differ from SQL semantics. Furthermore, the semantics of changelog streams and tombstone messages (that are used for deletes) are a new concept in stream processing.

Kafka’s journey from Pub/Sub broker to distributed streaming platform is well underway, and our times as engineers are very exciting!

I didn’t know you could join streams together in Kafka, so that’s really cool.

Comments closed

Creating A Simple Kafka Streams Application

Published 2017-09-15 by Kevin Feasel

Bill Bejeck has built a simple Kafka Streams application for us:

This blog post will quickly get you off the ground and show you how Kafka Streams works. We’re going to make a toy application that takes incoming messages and upper-cases the text of those messages, effectively yelling at anyone who reads the message. This application is called the yelling application.

Before diving into the code, let’s take a look at the processing topology you’ll assemble for this “yelling” application. We’ll build a processing graph topology, where each node in the graph has a particular function.

His entire application is 20 lines of code but it does function as a valid Kafka Streams app and works well as a demo.

Comments closed

Message Transformation Within Kafka

Published 2017-09-08 by Kevin Feasel

Robin Moffatt shows how to use Single Message Transforms inside Kafka Connect to reshape messages as you send them downstream:

Single Message Transforms (SMT) is a functionality within Kafka Connect that enables the transformation … of single messages. Clever naming, right?! Anything that’s more complex, such as aggregating or joins streams of data should be done with Kafka Streams — but simple transformations can be done within Kafka Connect itself, without needing a single line of code.

SMTs are applied to messages as they flow through Kafka Connect; inbound it modifies the message before it hits Kafka, outbound and the message in Kafka remains untouched but the data landed downstream is modified.

There’s quite a bit you can do with this, so check it out.

Comments closed

How The New York Times Uses Apache Kafka

Published 2017-09-07 by Kevin Feasel

Boerge Svingen gives us an architectural overview of how the New York Times uses Apache Kafka to link different services together:

These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.

On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.

This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache Kafka^TM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers. The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.

This is a nice write-up of a real-world use case for Kafka.

Comments closed

Lambda And Kappa Architectures

Published 2017-08-31 by Kevin Feasel

Michael Verrilli has a post contrasting the Lambda and Kappa data architectures:

Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and is replaced with data in the batch views.

One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.

The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.

I haven’t seen much on the topic of Big Data architectures this year; it seems like it was a much more popular topic last year.

Comments closed

KSQL: Streaming SQL For Kafka

Published 2017-08-29 by Kevin Feasel

Neha Narkhende announces KSQL:

I’m really excited to announce KSQL, a streaming SQL engine for Apache Kafka^TM. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.

Feasel’s Law wins again. The syntax looks pretty similar to Spark Streaming and Stream Analytics, so if you get those, you’ll get this.

Comments closed

Learning Spark Structured Streaming

Published 2017-08-28 by Kevin Feasel

Jules Damji has a nice compendium of links and additional resources for people wanting to learn more about Apache Spark’s Structured Streaming:

Structured Streaming In Apache Spark: A new high-level API for streaming

Databricks’ engineers and Apache Spark committers Matei Zaharia, Tathagata Das, Michael Armbrust and Reynold Xin expound on why streaming applications are difficult to write, and how Structured Streaming addresses all the underlying complexities.

There’s quite a bit of reading material on the other side.

Comments closed

Kafka Connect To Elasticsearch

Published 2017-08-24 by Kevin Feasel

Robin Moffatt shows how to take data from Kafka Connect and feed it into Elasticsearch:

Whilst Kafka Connect is part of Apache Kafka itself, if you want to stream data from Kafka to Elasticsearch you’ll want the Confluent Open Source distribution (or at least, the Elasticsearch connector).

The configuration is pretty simple. As before, see inline comments for details

It’s worth noting that if you’re using the same convertor throughout your pipelines (Avro, in this case) you’d actually put this in the Connect worker config itself rather than repeating it for each connector configuration.

This is a simple example which shows just how easy it can be.

Comments closed

A Simple Example With Spark And Kafka

Published 2017-08-18 by Kevin Feasel

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka:

This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.

The following tools will be used:

Git—to manage and clone source code
Docker—to run some services in containers
Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala
Maven 3—to compile the code we write
Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial
Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.

The Hello World of streaming apps is a Twitter client.

Comments closed