Streaming – Page 18 – Curated SQL

Streaming ETL of Rail Data with Kafka

Published 2019-10-18 by Kevin Feasel

Robin Moffatt has an interesting architecture and implementation for Kafka:

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka^® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

The code is also available in a GitHub repo.

Comments closed

From Kafka to Pulsar

Published 2019-10-07 by Kevin Feasel

Avaro Santos Andres has arguments for migrating from Apache Kafka to Apache Pulsar:

Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security, and reliability. In addition, for legal reasons you must partition data by country, device, and city. These requirements seem reasonable, and in 2019, stream-processing platforms must be able to deal with them.
But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.

I like this sort of competition, as I know Kafka will step up their game as a result.

Comments closed

Spark Streaming DStreams

Published 2019-09-19 by Kevin Feasel

Manish Mishra explains the fundamental abstraction of Spark Streaming:

Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. To get the details on the available input sources supported by Spark, you can refer to this section. As part of this blog, we will take the example of Kafka.

Read on to see an example of pulling data from Kafka and converting inputs into microbatches.

Comments closed

Flink’s State Processor API

Published 2019-09-16 by Kevin Feasel

Seth Wiesman and Fabian Hueske show off Apache Flink’s State Processor API:

The State Processor API that comes with Flink 1.9 is a true game-changer in how you can work with application state! In a nutshell, it extends the DataSet API with Input and OutputFormats to read and write savepoint or checkpoint data. Due to the interoperability of DataSet and Table API, you can even use relational Table API or SQL queries to analyze and process state data.
For example, you can take a savepoint of a running stream processing application and analyze it with a DataSet batch program to verify that the application behaves correctly. Or you can read a batch of data from any store, preprocess it, and write the result to a savepoint that you use to bootstrap the state of a streaming application. It’s also possible to fix inconsistent state entries now. Finally, the State Processor API opens up many ways to evolve a stateful application that were previously blocked by parameter and design choices that could not be changed without losing all the state of the application after it was started. For example, you can now arbitrarily modify the data types of states, adjust the maximum parallelism of operators, split or merge operator state, re-assign operator UIDs, and so on

Read on to learn more about how this works.

Comments closed

Derivative Event Sourcing

Published 2019-09-13 by Kevin Feasel

Anna McDonald explains the concept of derivative event sourcing:

If you happen to be the proud owner of a single order service, then you are all set to begin.
But what if you have more than one order service?
Something that tends to happen at companies that have been around for more than a sprint is the accumulation of technical debt. Sometimes that debt takes the form of duplicate applications. Mergers happen and you adopt other applications that, for reasons beyond your control, cannot be retired or rewritten right away. In other words, sometimes you end up with more than one order service—enter derivative event sourcing!

This is a nice article for real-life scenarios where you don’t get to build nice, well-designed services from scratch.

Comments closed

Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Published 2019-08-30 by Kevin Feasel

Fokko Driesprong gives us a proof of concept architecture for real-time analytics in the Hadoop ecosystem:

Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.
Divolte can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems. By using a JavaScript tag in the browser of the customers, it gathers data about their behaviour on the website or application. You’re in full control what you do, and don’t want to capture.

Click through for the example.

Comments closed

Apache Flink 1.9 Released

Published 2019-08-27 by Kevin Feasel

The Apache Flink crew announces version 1.9.0:

The Apache Flink project’s goal is to develop a stream processing system to unify and power many forms of real-time and offline data processing applications as well as event-driven applications. In this release, we have made a huge step forward in that effort, by integrating Flink’s stream and batch processing capabilities under a single, unified runtime.
Significant features on this path are batch-style recovery for batch jobs and a preview of the new Blink-based query engine for Table API and SQL queries. We are also excited to announce the availability of the State Processor API, which is one of the most frequently requested features and enables users to read and write savepoints with Flink DataSet jobs. Finally, Flink 1.9 includes a reworked WebUI and previews of Flink’s new Python Table API and its integration with the Apache Hive ecosystem.

Click through for the major changes.

Comments closed

The Basics of Apache Airflow

Published 2019-08-26 by Kevin Feasel

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Cloudera Stream Processing

Published 2019-08-19 by Kevin Feasel

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor:

Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on real-time streaming data using Apache Kafka. CSP also supports key management and monitoring capabilities powered by Cloudera Streams Management (CSM).

Sounds like they’re taking the Kafka route over Spark Streaming, Flink, Airflow, etc.

Comments closed

Kafka 2.3 and Kafka Connect Improvements

Published 2019-08-16 by Kevin Feasel

Robin Moffatt goes over improvements in Kafka Connect with the release of Apache Kafka 2.3:

A Kafka Connect cluster is made up of one or more worker processes, and the cluster distributes the work of connectors as tasks. When a connector or worker is added or removed, Kafka Connect will attempt to rebalance these tasks. Before version 2.3 of Kafka, the cluster stopped all tasks, recomputed where to run all tasks, and then started everything again. Each rebalance halted all ingest and egress work for usually short periods of time, but also sometimes for a not insignificant duration of time.
Now with KIP-415, Apache Kafka 2.3 instead uses incremental cooperative rebalancing, which rebalances only those tasks that need to be started, stopped, or moved. For more details, there are available resources that you can read, listen, and watch, or you can hear the lead engineer on the work, Konstantine Karantasis, talk about it in person at the upcoming Kafka Summit.

Looks like some nice improvements here.

Comments closed

Category: Streaming