Category: Streaming

The solution uses the following open-source tools. The solution architecture is illustrated below.

Apache Kafka Connect is a tool to stream data between Apache Kafka and other components.

InfluxDB which is a time series database from InfluxData. It will be used to store time series data from Kafka input and output topics.

Influx sink connector from Datamountaineer. It is a connector and sink to write events from Kafka to InfluxDB.

Chronograf is an open-source monitoring solution from InfluxData.

Click through for the solution.

Comments closed

Stream Reactor Update

Published 2018-01-23 by Kevin Feasel

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0:

Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka into other systems. The connectors support KCQL (Kafka Connect Query Language), an open source component of Lenses SQL Enginethat provides an elegant and simple SQL like syntax for selecting fields and routing from sources or topics to Kafka or the target system (topic to target entity mapping, field selection, auto creation, auto evolution, error policies).

We hope you find Stream Reactor useful, and want to give it a try! Stream Reactor has over 25 connectors available, tested and documented, supporting both Kafka 0.11 and Kafka 1.0 and you can give it a go by downloading Lenses Development Environment or find the jars on GitHub, or even build the code locally and help us improve and add even more connectors.

Read on for more details, as well as a link to the GitHub repo.

Comments closed

Apache NiFi 1.5 Updates

Published 2018-01-17 by Kevin Feasel

Tim Spann shows off some nice additions to Apache NiFi:

Another cool processor that I will talk about in greater detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive processor with its Livy Controller Service gives you a much better alternative to my hacky REST integration to calling Apache Spark batch and machine learning jobs.

There are a number of enhancements, new processors, and upgrades I’m excited about, but the main reason I am writing today is because of a new feature that allows for having an Agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or download and then, you’ll use Apache Maven to install Apache NiFi Registry and start it. This process will become even easier with future Ambari integration for a CLI-free install.

To integrate the Registry with Apache NiFi, you need to add a Registry Client. It’s very simple to add the default local one — see below.

There are several new features in the latest release.

Comments closed

Streaming Analytics With Kafka

Published 2018-01-05 by Kevin Feasel

Rathnadevi Manivannan shows how to use Kafka SQL to query streaming data:

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish trip data from the source to other destinations for further analysis.

In this article, let’s discuss enriching the Citi Bike trip data and finding the number of trips on a particular day to and from a particular station.

Read on for a nice tutorial.

Comments closed

Streaming Performance Counters Into Power BI

Published 2017-12-28 by Kevin Feasel

Chris Koester shows how to load Performance Counters (i.e., what Perfmon displays) into Power BI in near real time:

In the previous post I showed how you can Push Data into Power BI Streaming Datasets with C#. That example used dummy data. In this post I’ll show how to push performance counter data into a Power BI Streaming Dataset as a real world example. This scenario allows for monitoring a computer or application in near real time in the browser.

I won’t go through the steps of creating a Power BI Streaming Dataset. You can reference my previous post if you need instructions. I will note that the value names that you choose in the Streaming Dataset must match the C# property names for the script to work. This is noted in the code comments as well.

Check it out.

Comments closed

Unit Testing Spark Streaming DStreams

Published 2017-12-27 by Kevin Feasel

Anuj Saxena gives an example of using StreamingSuiteBase to build unit tests for DStreams in Spark Streaming:

So what’s the problem? How to execute streaming logic in a test environment.

We can write Integration test cases and provide the actual environment in the integration test. But for unit testing, we need a testing environment which should not depend on any external application.

Click through for the example.

Comments closed

What’s New With KSQL

Published 2017-12-22 by Kevin Feasel

Hojjat Jafarpour announces Kafka’s KSQL version 0.3:

Additionally, we have taken the first steps to provide metrics and observability in KSQL. This greatly enhances the operability of KSQL, like in cases where you’re monitoring KSQL capacity or when diagnosing issues. You can now see different metrics for streams, tables, and queries for every KSQL server instance.

For streams and tables, we now have DESCRIBE EXTENDED <stream/table name> statement to show statistics, such as number of messages processed per second, total messages, the time when the last message was received, as well as corresponding failure metrics.

Looks like they’re building it out a piece at a time.

Comments closed

So You Really Want To Monitor Kafka…

Published 2017-12-18 by Kevin Feasel

Yeva Byzek walks through Confluent Platform:

Kafka exposes hundreds of metrics. Some of them are per broker, per client, per topic, and per partition, and so the number of metrics scales up as the cluster grows. For an average-size Kafka cluster, the number of metrics very quickly bloats to the thousands.

Warning: I am about to disappoint you. You probably recognize that you realistically cannot monitor every single available metric. So you are probably hoping that in this blog post I will filter down the list of metrics to a dozen of the most critical ones, which you would then push through some generic monitoring tool, and then be done with setting up “monitoring.” However, monitoring distributed systems like Kafka is not that simple, and so there is no such list. Keep reading to understand the problems you should be solving, and how to solve them in a robust monitoring solution specifically designed for Kafka.

A common pitfall of generic monitoring tools is that they import all available metrics from a variety of systems into a metrics swamp. Even with a comprehensive list of metrics, there is a limit to what can be achieved with no Kafka context nor Kafka expertise to determine which metrics are important and which ones are not. A metrics swamp cannot produce valuable insight from the data nor provide answers to the critical business questions we asked earlier.

This is an information-dense post that you’ll want to read if you work with Apache Kafka.

Comments closed

Enabling Exactly-Once Kafka Streams

Published 2017-12-15 by Kevin Feasel

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message A before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Comments closed

Kafka Streams And Time-Based Batching

Published 2017-12-11 by Kevin Feasel

Vladimir Vajda provides a warning for people using Kafka Streams:

To completely understand the problem, we will first go into detail how ingestion and processing occur by default in Kafka Streams. For example purposes, the punctuate method is configured to occur every ten seconds, and in the input stream, we have exactly one message per second. The purpose of the job is to parse input messages, collect them, and, in the punctuate method, do a batch insert in the database, then to send metrics.

After running the Kafka Stream application, the Processor will be created, followed by the initmethod. Here is where all the connections are established. Upon successful start, the application will listen to input topic for incoming messages. It will remain idle until the first message arrives. When the first message arrives, the process method is called — this is where transformations occur and where the result is stored for later use. If no messages are in the input topic, the application will go idle again, waiting for the next message. After each successful process, the application checks if punctuate should be called. In our case, we will have ten process calls followed by one punctuate call, with this cycle repeating indefinitely as long as there are messages.

A pretty obvious behavior, isn’t it? Then why is one bolded?

Read on for more, including how to handle this edge case.

Comments closed