Press "Enter" to skip to content

Category: Streaming

User-Defined Functions In KSQL

Kai Waehner demonstrates building a user-defined function for Kafka Streams:

As you can see, the full implementation is just a few lines of Java code. In general, you need to implement the logic between receiving input and returning output of the UDF in the evaluate()method. You also need to implement exception handling (e.g. invalid input arguments) where applicable. The init() method is empty in this case, but could initialise any required object instances.

Note that this UDF has state: dateFormat can be null or already initialized. However, no worries. You do not have to manage the scope as Kafka Streams (and therefore KSQL) threads are independent of each other. So this won’t cause any issues.

Click through for the entire process.

Comments closed

Hortonworks DataFlow 3.1 Released

George Vetticaden and Haimo Liu announce Hortonworks DataFlow version 3.1:

Apache Kafka 1.0 support with full integration with HDF Services – Kafka 1.0 provides important new features including more stringent message processing semantics with support for message headers and transactions, performance improvements and advanced security options.

  • Apache Ambari support for Kafka 1.0 – Install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari.

  • Apache Ranger support for Kafka 1.0 – Manage access control policies (ACLs) using resource or tag-based security for Kafka 1.0 clusters.

  • New NiFi and SAM processors for Kafka 1.0 – New processors in NiFi and Hortonworks Streaming Analytics Manager (SAM) support Kafka 1.0 features including message headers and transactions.

Click through for the list of top changes.

Comments closed

KSQL 0.4 Released

Apurva Mehta announces the release of KSQL 0.4:

The SHOW TOPICS command has been enhanced to include the number of active consumers and also the number of active consumer groups which are reading the topics.

Consumer groups are a feature of Apache Kafka which enable multiple consumer processes to divide the work of consuming Kafka topic. You can learn more about them in the Kafka Consumer JavaDocs, and of course you should read the SHOW TOPICS documentation for more information.

Read on for the full set of changes.

Comments closed

Subtle Changes In Application Behavior Across Kafka Streams Versions

Aleksandar Pejakovic shows some subtle but important changes to an application running Kafka Streams 0.11 versus 1.0:

One would expect that by changing the version, the previous behavior would remain the same. Well, it hasn’t. What has changed?

After each process method, a punctuate method is called. After punctuateInterval is scheduled, punctuate also occurs. This means the following:

  • In the first test scenario, each “Arrived: message_<offset>” message in the console is accompanied with “Punctuate call”. Unsurprisingly, we have one: “Processed: 1” message in output topic. After ten messages, we have another: “Punctuate call” and “Processed: 0” pair.
  • In the second scenario, we have nine: “Arrived: message_<offset>” and “Punctuate call” pairs on the console, followed with 9: “Processed: 1” in the output topic. After the pause and tenth message we have: “Arrived: message_<offset>” and 3 “Punctuate call”. In the output topic, we see “Processed: 1”, “Processed: 0”, and “Processed 0”.

Read the whole thing.  This sort of behavioral change can be hard to suss out when testing a streaming application.

Comments closed

Monitoring Kafka Streaming Pipelines

Randhir Singh shows how to use open-source tools to monitor Kafka streaming pipelines:

The solution uses the following open-source tools. The solution architecture is illustrated below.

  • Apache Kafka Connect is a tool to stream data between Apache Kafka and other components.
  • InfluxDB which is a time series database from InfluxData. It will be used to store time series data from Kafka input and output topics.
  • Influx sink connector from Datamountaineer. It is a connector and sink to write events from Kafka to InfluxDB.
  • Chronograf is an open-source monitoring solution from InfluxData.

Click through for the solution.

Comments closed

Stream Reactor Update

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0:

Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka into other systems. The connectors support KCQL (Kafka Connect Query Language), an open source component of Lenses SQL Enginethat provides an elegant and simple SQL like syntax for selecting fields and routing from sources or topics to Kafka or the target system (topic to target entity mapping, field selection, auto creation, auto evolution, error policies).

We hope you find Stream Reactor useful, and want to give it a try! Stream Reactor has over 25 connectors available, tested and documented, supporting both Kafka 0.11 and Kafka 1.0 and you can give it a go by downloading Lenses Development Environment or find the jars on GitHub, or even build the code locally and help us improve and add even more connectors.

Read on for more details, as well as a link to the GitHub repo.

Comments closed

Apache NiFi 1.5 Updates

Tim Spann shows off some nice additions to Apache NiFi:

Another cool processor that I will talk about in greater detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive processor with its Livy Controller Service gives you a much better alternative to my hacky REST integration to calling Apache Spark batch and machine learning jobs.

There are a number of enhancements, new processors, and upgrades I’m excited about, but the main reason I am writing today is because of a new feature that allows for having an Agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or download and then, you’ll use Apache Maven to install Apache NiFi Registry and start it. This process will become even easier with future Ambari integration for a CLI-free install.

To integrate the Registry with Apache NiFi, you need to add a Registry Client. It’s very simple to add the default local one — see below.

There are several new features in the latest release.

Comments closed

Streaming Analytics With Kafka

Rathnadevi Manivannan shows how to use Kafka SQL to query streaming data:

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish trip data from the source to other destinations for further analysis.

In this article, let’s discuss enriching the Citi Bike trip data and finding the number of trips on a particular day to and from a particular station.

Read on for a nice tutorial.

Comments closed

Streaming Performance Counters Into Power BI

Chris Koester shows how to load Performance Counters (i.e., what Perfmon displays) into Power BI in near real time:

In the previous post I showed how you can Push Data into Power BI Streaming Datasets with C#. That example used dummy data. In this post I’ll show how to push performance counter data into a Power BI Streaming Dataset as a real world example. This scenario allows for monitoring a computer or application in near real time in the browser.

I won’t go through the steps of creating a Power BI Streaming Dataset. You can reference my previous post if you need instructions. I will note that the value names that you choose in the Streaming Dataset must match the C# property names for the script to work. This is noted in the code comments as well.

Check it out.

Comments closed

Unit Testing Spark Streaming DStreams

Anuj Saxena gives an example of using StreamingSuiteBase to build unit tests for DStreams in Spark Streaming:

So what’s the problem? How to execute streaming logic in a test environment.

We can write Integration test cases and provide the actual environment in the integration test. But for unit testing, we need a testing environment which should not depend on any external application.

Click through for the example.

Comments closed