Press "Enter" to skip to content

Category: Streaming

The Basics Of Azure Stream Analytics

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed

Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed

A SQL Client For Apache Flink

Alex Woodie points out that Apache Flink now has a SQL client built in:

Apache Flink has contained SQL functionality since Flink version 1.1, which introduced a SQL API based on Apache Calcite and a table API, too. While the combined SQL and Table API today provides valuable ways for developers to apply well-understood relational data and SQL constructs to the world of stream data processing, its usefulness is somewhat limited.

For starters, only Scala and Java experts can avail themselves of API, according to the description of the new SQL client, which is codenamed FLIP-24. What’s more, any table program that was written with the SQL and Table API had to be packaged with Apache Maven, a Java-based project management tool, and submitted to the Flink cluster before running.

With the launch of the SQL CLI Client in Flink version 1.5, the Flink community is taking its support for SQL in a new direction. According to the FLIP-24 project page, providing an interactive shell will not only make Flink accessible to non-programmers, including data scientists, but it will also eliminate the need for a full IDE to program Flink apps. With millions of SQL-loving data analysts out there, the benefits could certainly be vast.

Good stuff.  Feasel’s Law in action.

Comments closed

Visualization Over Kafka And KSQL

Shant Hovsepian shows off a data visualization tool which can read Kafka Streams data:

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

Click through for a couple repos and demos.

Comments closed

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams:

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \
HOST, INVALID_LOGIN_COUNT \
FROM INVALID_USERS_LOGINS_PER_HOST;
1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1
1521646620000 | 2018-03-21 15:37:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:18:00 | rpi-03 | 1
1521649260000 | 2018-03-21 16:21:00 | rpi-03 | 4
1521649320000 | 2018-03-21 16:22:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:38:00 | rpi-03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING. We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

It’s a very interesting use case.

Comments closed

Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of  spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Comments closed

Filtering On Kafka Streams

Robin Moffatt has a new series showing how to use Kafka Streams for dealing with syslog data:

syslog is one of those ubiquitous standards on which much of modern computing runs. Built into operating systems such as Linux, it’s also commonplace in networking and IoT devices like IP cameras. It provides a way for streaming log messages, along with metadata such as the source host, severity of the message, and so on. Sometimes the target is simply a local logfile, but more often it’s a centralised syslog server which in turn may log or process the messages further.

As a high-performance, distributed streaming platform, Apache Kafka® is a great tool for centralised ingestion of syslog data. Since Apache Kafka also persists data and supports native stream processing we don’t need to land it elsewhere before we can utilise the data. You can stream syslog data into Kafka in a variety of ways, including through Kafka Connect for which there is a dedicated syslog plugin.

In this post, we’re going to see how KSQL can be used to process syslog messages as they arrive in realtime.

Check it out.

Comments closed

Continuous Processing Mode With Spark Structured Streaming

Joseph Torres, et al, explain how continuous processing mode works with Apache Spark 2.3’s structured streaming:

Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.

Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays.

They also explain how this newer model differs from the prior model of collecting events in microbatches.

Comments closed

Joining Multiple Types Of Data With KSQL

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL:

This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE clause), or even aggregations.

You can see which queries are running with the SHOW QUERIES; statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.

The DESCRIBE EXTENDED command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.

It’s pretty easy to do; click through to see just how easy.

Comments closed