Streaming – Page 17 – Curated SQL

The latest integration between Flink 1.9.0 and Pulsar addresses most of the previously mentioned shortcomings. The contribution of Alibaba’s Blink to the Flink repository adds many enhancements and new features to the processing framework that make the integration with Pulsar significantly more powerful and impactful. Flink 1.9.0 brings Pulsar schema integration into the picture, makes the Table API a first-class citizen and provides an exactly-once streaming source and at-least-once streaming sink with Pulsar. Lastly, with schema integration, Pulsar can now be registered as a Flink catalog, making running Flink queries on top of Pulsar streams a matter of a few commands. In the following sections, we will take a closer look at the new integrations and provide examples of how to query Pulsar streams using Flink SQL.

Read on to see this integration in action.

Comments closed

KSQL to ksqlDB

Published 2019-11-27 by Kevin Feasel

Jay Kreps announces a new naming for KSQL:

Today marks a new release of KSQL, one so significant that we’re giving it a new name: ksqlDB. Like KSQL, ksqlDB remains freely available and community licensed, and you can get the code directly on GitHub. I’ll first share about what we’ve added in this release, then talk about why I think it is so important and explain the new naming.
There are two new major features we’re adding: pull queries and connector management.

This looks really interesting.

Comments closed

Streaming ETL of Rail Data with Kafka

Published 2019-10-18 by Kevin Feasel

Robin Moffatt has an interesting architecture and implementation for Kafka:

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka^® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

The code is also available in a GitHub repo.

Comments closed

From Kafka to Pulsar

Published 2019-10-07 by Kevin Feasel

Avaro Santos Andres has arguments for migrating from Apache Kafka to Apache Pulsar:

Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security, and reliability. In addition, for legal reasons you must partition data by country, device, and city. These requirements seem reasonable, and in 2019, stream-processing platforms must be able to deal with them.
But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.

I like this sort of competition, as I know Kafka will step up their game as a result.

Comments closed

Spark Streaming DStreams

Published 2019-09-19 by Kevin Feasel

Manish Mishra explains the fundamental abstraction of Spark Streaming:

Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. To get the details on the available input sources supported by Spark, you can refer to this section. As part of this blog, we will take the example of Kafka.

Read on to see an example of pulling data from Kafka and converting inputs into microbatches.

Comments closed

Flink’s State Processor API

Published 2019-09-16 by Kevin Feasel

Seth Wiesman and Fabian Hueske show off Apache Flink’s State Processor API:

The State Processor API that comes with Flink 1.9 is a true game-changer in how you can work with application state! In a nutshell, it extends the DataSet API with Input and OutputFormats to read and write savepoint or checkpoint data. Due to the interoperability of DataSet and Table API, you can even use relational Table API or SQL queries to analyze and process state data.
For example, you can take a savepoint of a running stream processing application and analyze it with a DataSet batch program to verify that the application behaves correctly. Or you can read a batch of data from any store, preprocess it, and write the result to a savepoint that you use to bootstrap the state of a streaming application. It’s also possible to fix inconsistent state entries now. Finally, the State Processor API opens up many ways to evolve a stateful application that were previously blocked by parameter and design choices that could not be changed without losing all the state of the application after it was started. For example, you can now arbitrarily modify the data types of states, adjust the maximum parallelism of operators, split or merge operator state, re-assign operator UIDs, and so on

Read on to learn more about how this works.

Comments closed

Derivative Event Sourcing

Published 2019-09-13 by Kevin Feasel

Anna McDonald explains the concept of derivative event sourcing:

If you happen to be the proud owner of a single order service, then you are all set to begin.
But what if you have more than one order service?
Something that tends to happen at companies that have been around for more than a sprint is the accumulation of technical debt. Sometimes that debt takes the form of duplicate applications. Mergers happen and you adopt other applications that, for reasons beyond your control, cannot be retired or rewritten right away. In other words, sometimes you end up with more than one order service—enter derivative event sourcing!

This is a nice article for real-life scenarios where you don’t get to build nice, well-designed services from scratch.

Comments closed

Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Published 2019-08-30 by Kevin Feasel

Fokko Driesprong gives us a proof of concept architecture for real-time analytics in the Hadoop ecosystem:

Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.
Divolte can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems. By using a JavaScript tag in the browser of the customers, it gathers data about their behaviour on the website or application. You’re in full control what you do, and don’t want to capture.

Click through for the example.

Comments closed

Apache Flink 1.9 Released

Published 2019-08-27 by Kevin Feasel

The Apache Flink crew announces version 1.9.0:

The Apache Flink project’s goal is to develop a stream processing system to unify and power many forms of real-time and offline data processing applications as well as event-driven applications. In this release, we have made a huge step forward in that effort, by integrating Flink’s stream and batch processing capabilities under a single, unified runtime.
Significant features on this path are batch-style recovery for batch jobs and a preview of the new Blink-based query engine for Table API and SQL queries. We are also excited to announce the availability of the State Processor API, which is one of the most frequently requested features and enables users to read and write savepoints with Flink DataSet jobs. Finally, Flink 1.9 includes a reworked WebUI and previews of Flink’s new Python Table API and its integration with the Apache Hive ecosystem.

Click through for the major changes.

Comments closed

The Basics of Apache Airflow

Published 2019-08-26 by Kevin Feasel

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Category: Streaming

Querying Pulsar Streams with Apache Flink

KSQL to ksqlDB

Streaming ETL of Rail Data with Kafka

From Kafka to Pulsar

Spark Streaming DStreams

Flink’s State Processor API

Derivative Event Sourcing

Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Apache Flink 1.9 Released

The Basics of Apache Airflow