Category: Streaming

With Confluent Platform 5.0, operators can secure infrastructure using the new, easy-to-use LDAP authorizer plugin and can deliver faster disaster recovery (DR) thanks to automatic offset translation in Confluent Replicator. In Confluent Control Center, operators can now view broker configurations and inspect consumer lag to ensure that they are getting the most out of Kafka and that applications are performing as expected.

We have also introduced advanced capabilities for developers. In Confluent Control Center, developers can now better understand the data in Kafka topics due to the new topic inspection feature and Confluent Schema Registry integration. Control Center presents a new graphical user interface (GUI) for writing KSQL, making stream processing more effortless and intuitive as well. The latest version of KSQL itself introduces exciting additions, such as support for nested data, user-defined functions (UDFs), new types of joins and an enhanced REST API. Furthermore, Confluent Platform 5.0 includes the new Confluent MQTT Proxy for easier Internet of Things (IoT) integration with Kafka. The latest release is built on Apache Kafka 2.0, which features several new functionalities and performance improvements.

Looks like there have been some nice incremental improvements here.

Comments closed

Ingesting Multiple Data Sources With NiFi And MiniFi

Published 2018-07-31 by Kevin Feasel

Tim Spann shows how to collect data from multiple IoT devices using MiniFi and send it to a NiFi host:

So I designed my MiniFi flow in the Apache NiFi UI (pretty soon there will be a special designer for this). You then highlight everything there and hit ‘Create Template.’ You can then export it and convert it to config.yml. Again, this process will be automated and connected with the NiFi Registry very shortly to reduce the amount of clicking.

This is an example. When you connect to it in your flow you design it in Apache NiFi UI, you will connect to this port on the Remote Processor Group. If you are manually editing one (okay never do this, but sometimes I have to), you can copy that ID from this Port Details and past it in the file.

I like this as an overview of NiFi’s capabilities and a sneak peek at where they’re going.

Comments closed

Enriching Syslog Data In A Kafka Pipeline

Published 2018-06-28 by Kevin Feasel

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams. If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed

Visualizing Data In Real Time With SQL Server And Dash

Published 2018-06-19 by Kevin Feasel

Tomaz Kastrun shows how to use Python Dash to visualize data living in SQL Server in real time:

The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.

Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.

Comments closed

Understanding A Spark Streaming Workflow

Published 2018-06-18 by Kevin Feasel

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.
Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.
Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Comments closed

The Basics Of Azure Stream Analytics

Published 2018-06-06 by Kevin Feasel

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed

Using Kafka To Go From Batch To Stream

Published 2018-06-05 by Kevin Feasel

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka. Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.
Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.
New valid reviews are published to another database table (which contains all the historic valid reviews).
Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.
The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed

A SQL Client For Apache Flink

Published 2018-05-30 by Kevin Feasel

Alex Woodie points out that Apache Flink now has a SQL client built in:

Apache Flink has contained SQL functionality since Flink version 1.1, which introduced a SQL API based on Apache Calcite and a table API, too. While the combined SQL and Table API today provides valuable ways for developers to apply well-understood relational data and SQL constructs to the world of stream data processing, its usefulness is somewhat limited.

For starters, only Scala and Java experts can avail themselves of API, according to the description of the new SQL client, which is codenamed FLIP-24. What’s more, any table program that was written with the SQL and Table API had to be packaged with Apache Maven, a Java-based project management tool, and submitted to the Flink cluster before running.

With the launch of the SQL CLI Client in Flink version 1.5, the Flink community is taking its support for SQL in a new direction. According to the FLIP-24 project page, providing an interactive shell will not only make Flink accessible to non-programmers, including data scientists, but it will also eliminate the need for a full IDE to program Flink apps. With millions of SQL-loving data analysts out there, the benefits could certainly be vast.

Good stuff. Feasel’s Law in action.

Comments closed

Stream-To-Stream Joins In Spark

Published 2018-05-25 by Kevin Feasel

Ayush Tiwari shows how to join a pair of streams in Apache Spark 2.3:

In Spark 2.3, it added support for stream-stream joins, i.e, we can join two streaming Datasets/DataFrames and in this blog we are going to see how beautifully spark now give support for joining the two streaming dataframes.

I this example, I am going to use
Apache Spark 2.3.0
Apache Kafka 0.11.0.1
Scala 2.11.8

Click through for the demo.

Comments closed

Visualization Over Kafka And KSQL

Published 2018-05-24 by Kevin Feasel

Shant Hovsepian shows off a data visualization tool which can read Kafka Streams data:

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

Click through for a couple repos and demos.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31