In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.
- Extract: Data is available from a source system
- Transform: We want to filter, cleanse or otherwise enrich this source data
- Load: Make the data available to another application
There are two key concepts here:
- Data is created by an application, and we want it to be available to other applications
- We often want to process the data (for example, cleanse and apply business logic to it) before it is used
Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?
Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now. The set of tools expand, but the nature of the job remains similar.
In simple terms, a Kalman filter is a theoretical model to predict the state of a dynamic system under measurement noise. Originally developed in the 1960s, the Kalman filter has found applications in many different fields of technology including vehicle guidance and control, signal processing, transportation, analysis of economic data, and human health state monitoring, to name a few (see the Kalman filter Wikipedia page for a detailed discussion). A particular application area for the Kalman filter is signal estimation as part of time series analysis. Apache Spark provides a great framework to facilitate time series stream processing. As such, it would be useful to discuss how the Kalman filter can be combined with Apache Spark.
In this article, we will implement a Kalman filter for a simple dynamic model using the Apache Spark Structured Streaming engine and an Apache Kafka data source. We will use Apache Spark version 2.3.1 (latest, as of writing this article), Java version 1.8, and Kafka version 2.0.0. The article is organized as follows: the next section gives an overview of the dynamic model and the corresponding Kalman filter; the following section will discuss the application architecture and the corresponding deployment model, and in that section we will also review the Java code comprising different modules of the application; then, we will show graphically how the Kalman filter performs by comparing the predicted variables to measured variables under random measurement noise; we’ll wrap up the article by giving concluding remarks.
This is going on my “reread carefully” list; it’s very interesting and goes deep into the topic.
The first thing we will need to do is to establish the impression and conversion data streams. The impression data stream provides us a real-time view of the attributes associated with those customers who were served the digital ad (impression) while the conversion stream denotes customers who have performed an action (e.g. click the ad, purchased an item, etc.) based on that ad.
With Structured Streaming in Databricks, you can quickly plug into the stream as Databricks supports direct connectivity to Kafka (Apache Kafka, Apache Kafka on AWS, Apache Kafka on HDInsight) and Kinesis as noted in the following code snippet (this is for impressions, repeat this step for conversions)
This is definitely an interesting approach to the problem. Check it out.
The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.
If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.
Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.
The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?
Read the whole thing.
A great benefit of Confluent MQTT Proxy is simplicity for realizing IoT scenarios without the need for a MQTT Broker. You can forward messages directly from the MQTT devices to Kafka via the MQTT Proxy. This reduces efforts and costs significantly. This is a perfect solution if you “just” want to communicate between Kafka and MQTT devices.
If you want to see the other part of the story (integration with sink applications like Elasticsearch / Grafana), please take a look at the Github project “KSQL for streaming IoT data“. This realizes the integration with ElasticSearch and Grafana via Kafka Connect and the Elastic connector.
Check it out and then take a gander at Kai’s GitHub repo.
The joining event streams pattern takes multiple data pipelines and joins them to produce a new signal message that can be acted upon by a later process.
This pattern can is used in a variety of use cases. Here are a few examples:
Merging data for an individual across a variety of social media accounts.
Merging click data from a variety of devices (e.g. mobile and desktop) for an individual user.
Tracking locations of delivery vehicles and assets that need to be delivered.
Monitoring electronic trading activity for clients on a variety of trading venues.
Conceptually, it’s very similar to normal join operations, but there is a time element which complicates things.
With Confluent Platform 5.0, operators can secure infrastructure using the new, easy-to-use LDAP authorizer plugin and can deliver faster disaster recovery (DR) thanks to automatic offset translation in Confluent Replicator. In Confluent Control Center, operators can now view broker configurations and inspect consumer lag to ensure that they are getting the most out of Kafka and that applications are performing as expected.
We have also introduced advanced capabilities for developers. In Confluent Control Center, developers can now better understand the data in Kafka topics due to the new topic inspection feature and Confluent Schema Registry integration. Control Center presents a new graphical user interface (GUI) for writing KSQL, making stream processing more effortless and intuitive as well. The latest version of KSQL itself introduces exciting additions, such as support for nested data, user-defined functions (UDFs), new types of joins and an enhanced REST API. Furthermore, Confluent Platform 5.0 includes the new Confluent MQTT Proxy for easier Internet of Things (IoT) integration with Kafka. The latest release is built on Apache Kafka 2.0, which features several new functionalities and performance improvements.
Looks like there have been some nice incremental improvements here.
So I designed my MiniFi flow in the Apache NiFi UI (pretty soon there will be a special designer for this). You then highlight everything there and hit ‘Create Template.’ You can then export it and convert it to
config.yml. Again, this process will be automated and connected with the NiFi Registry very shortly to reduce the amount of clicking.
This is an example. When you connect to it in your flow you design it in Apache NiFi UI, you will connect to this port on the Remote Processor Group. If you are manually editing one (okay never do this, but sometimes I have to), you can copy that ID from this Port Details and past it in the file.
I like this as an overview of NiFi’s capabilities and a sneak peek at where they’re going.
In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.
I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams. If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.
The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.
Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.