Press "Enter" to skip to content

Category: Streaming

Databricks Delta: Data Skipping And ZORDER Clustering

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster:

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

Read the whole thing.

Comments closed

Combining Apache Kafka With TensorFlow

Kai Waehner has an example of an application which uses Apache Kafka to stream car sensor data to TensorFlow on Google ML Engine:

A great benefit of Confluent MQTT Proxy is simplicity for realizing IoT scenarios without the need for a MQTT Broker. You can forward messages directly from the MQTT devices to Kafka via the MQTT Proxy. This reduces efforts and costs significantly. This is a perfect solution if you “just” want to communicate between Kafka and MQTT devices.

If you want to see the other part of the story (integration with sink applications like Elasticsearch / Grafana), please take a look at the Github project “KSQL for streaming IoT data“. This realizes the integration with ElasticSearch and Grafana via Kafka Connect and the Elastic connector.

Check it out and then take a gander at Kai’s GitHub repo.

Comments closed

Joining Streams Of Data

Chuck Blake gives an example of joining two streams of data together in Wallaroo:

The joining event streams pattern takes multiple data pipelines and joins them to produce a new signal message that can be acted upon by a later process.

This pattern can is used in a variety of use cases. Here are a few examples:

  • Merging data for an individual across a variety of social media accounts.

  • Merging click data from a variety of devices (e.g. mobile and desktop) for an individual user.

  • Tracking locations of delivery vehicles and assets that need to be delivered.

  • Monitoring electronic trading activity for clients on a variety of trading venues.

Conceptually, it’s very similar to normal join operations, but there is a time element which complicates things.

Comments closed

Confluent Platform 5.0 Released

Raj Jain and Michael Noll walk through the latest version of Confluent Platform, Confluent’s Kafka solution:

With Confluent Platform 5.0, operators can secure infrastructure using the new, easy-to-use LDAP authorizer plugin and can deliver faster disaster recovery (DR) thanks to automatic offset translation in Confluent Replicator. In Confluent Control Center, operators can now view broker configurations and inspect consumer lag to ensure that they are getting the most out of Kafka and that applications are performing as expected.

We have also introduced advanced capabilities for developers. In Confluent Control Center, developers can now better understand the data in Kafka topics due to the new topic inspection feature and Confluent Schema Registry integration. Control Center presents a new graphical user interface (GUI) for writing KSQL, making stream processing more effortless and intuitive as well. The latest version of KSQL itself introduces exciting additions, such as support for nested data, user-defined functions (UDFs), new types of joins and an enhanced REST API. Furthermore, Confluent Platform 5.0 includes the new Confluent MQTT Proxy for easier Internet of Things (IoT) integration with Kafka. The latest release is built on Apache Kafka 2.0, which features several new functionalities and performance improvements.

Looks like there have been some nice incremental improvements here.

Comments closed

Ingesting Multiple Data Sources With NiFi And MiniFi

Tim Spann shows how to collect data from multiple IoT devices using MiniFi and send it to a NiFi host:

So I designed my MiniFi flow in the Apache NiFi UI (pretty soon there will be a special designer for this). You then highlight everything there and hit ‘Create Template.’ You can then export it and convert it to config.yml. Again, this process will be automated and connected with the NiFi Registry very shortly to reduce the amount of clicking.

This is an example. When you connect to it in your flow you design it in Apache NiFi UI, you will connect to this port on the Remote Processor Group. If you are manually editing one (okay never do this, but sometimes I have to), you can copy that ID from this Port Details and past it in the file.

I like this as an overview of NiFi’s capabilities and a sneak peek at where they’re going.

Comments closed

Enriching Syslog Data In A Kafka Pipeline

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams.  If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed

Visualizing Data In Real Time With SQL Server And Dash

Tomaz Kastrun shows how to use Python Dash to visualize data living in SQL Server in real time:

The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package  for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.

Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.

Comments closed

Understanding A Spark Streaming Workflow

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

  1. Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.

  2. Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.

  3. Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Comments closed

The Basics Of Azure Stream Analytics

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed

Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed