Press "Enter" to skip to content

Category: Streaming

Flink in Cloudera Streaming Analytics

Dinesh Chandrasekhar announces support for Apache Flink in Cloudera Streaming Analytics:

We cannot hold our excitement anymore! For the last few months, our Data-in-Motion engineering teams have been working hard to deliver a compelling and critical part of our Cloudera DataFlow (CDF) story. To enhance our Stream Processing and Analytics narrative within the overall Data-in-Motion platform, we give you support for Apache Flink with the general availability of Cloudera Streaming Analytics (CSA).

Cloudera Streaming Analytics, powered by Apache Flink, is a new product offering within the Cloudera DataFlow (CDF) platform that provides real-time stateful processing of IoT-scale data streams and complex events for predictive insights. Cloudera DataFlow (as seen in the picture below) is a comprehensive edge-to-cloud real-time streaming data platform. As one of the key pillars of CDF, stream processing & analytics is important for processing millions of data points and complex events coming from various streaming sources. Over the years, we have supported several streaming engines but the addition of Flink now makes CDF an extremely compelling platform for processing high-volumes of streaming data at high-scale. 

This is adding support for Flink; it looks like Spark Streaming and Kafka Streams are also supported, though they are pushing Flink as a first option rather than one among equals.

Comments closed

Using ksqlDB to Read Twitter Data

Robin Moffatt has a quick demonstration of ksqlDB:

I’m going to show you how to use ksqlDB to do the following:

– Configure the live ingest of a stream of data from an external source (in this case, Twitter)
– Filter the stream for certain columns
– Create a new stream populated only by messages that match a given predicate
– Build aggregate materialised views, and use pull queries to directly fetch the state from these

Let’s dive in! As always, you’ll find the full test rig for trying this out yourself on GitHub.

Comments closed

Testing Kafka Streams

Jukka Karvanen takes us through TopologyTestDriver in Apache Kafka Streams:

Similarly to testing a single record, it is possible to pipe Value lists into a TestInputTopic. Validating the output can be done record by record like before, or by using the readValueToList() method to see the big picture when validating the whole collection at the same time. For our example topology, when the test pipes in the values, it needs to validate the keys and use the readKeyValueToList() method.

Testing streams of data (regardless of the product) is enough of a mental shift that it’s not an easy problem. For that reason, I welcome any tool which simplifies the process.

Comments closed

Flink 1.8.3 Released

Hequn Cheng announces Flink 1.8.3:

The Apache Flink community released the third bugfix version of the Apache Flink 1.8 series.

This release includes 45 fixes and minor improvements for Flink 1.8.2. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.8.3.

There’s a nice list of bugfixes in the update.

Comments closed

Querying Pulsar Streams with Apache Flink

Sijie Guo and Markos Sfikas show how we can interact with Apache Pulsar using Apache Flink:

The latest integration between Flink 1.9.0 and Pulsar addresses most of the previously mentioned shortcomings. The contribution of Alibaba’s Blink to the Flink repository adds many enhancements and new features to the processing framework that make the integration with Pulsar significantly more powerful and impactful. Flink 1.9.0 brings Pulsar schema integration into the picture, makes the Table API a first-class citizen and provides an exactly-once streaming source and at-least-once streaming sink with Pulsar. Lastly, with schema integration, Pulsar can now be registered as a Flink catalog, making running Flink queries on top of Pulsar streams a matter of a few commands. In the following sections, we will take a closer look at the new integrations and provide examples of how to query Pulsar streams using Flink SQL.

Read on to see this integration in action.

Comments closed

KSQL to ksqlDB

Jay Kreps announces a new naming for KSQL:

Today marks a new release of KSQL, one so significant that we’re giving it a new name: ksqlDB. Like KSQL, ksqlDB remains freely available and community licensed, and you can get the code directly on GitHub. I’ll first share about what we’ve added in this release, then talk about why I think it is so important and explain the new naming.

There are two new major features we’re adding: pull queries and connector management.

This looks really interesting.

Comments closed

Streaming ETL of Rail Data with Kafka

Robin Moffatt has an interesting architecture and implementation for Kafka:

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

The code is also available in a GitHub repo.

Comments closed

From Kafka to Pulsar

Avaro Santos Andres has arguments for migrating from Apache Kafka to Apache Pulsar:

Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security, and reliability. In addition, for legal reasons you must partition data by country, device, and city. These requirements seem reasonable, and in 2019, stream-processing platforms must be able to deal with them.

But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.

I like this sort of competition, as I know Kafka will step up their game as a result.

Comments closed

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming:

Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. To get the details on the available input sources supported by Spark, you can refer to this section. As part of this blog, we will take the example of Kafka.

Read on to see an example of pulling data from Kafka and converting inputs into microbatches.

Comments closed

Flink’s State Processor API

Seth Wiesman and Fabian Hueske show off Apache Flink’s State Processor API:

The State Processor API that comes with Flink 1.9 is a true game-changer in how you can work with application state! In a nutshell, it extends the DataSet API with Input and OutputFormats to read and write savepoint or checkpoint data. Due to the interoperability of DataSet and Table API, you can even use relational Table API or SQL queries to analyze and process state data.

For example, you can take a savepoint of a running stream processing application and analyze it with a DataSet batch program to verify that the application behaves correctly. Or you can read a batch of data from any store, preprocess it, and write the result to a savepoint that you use to bootstrap the state of a streaming application. It’s also possible to fix inconsistent state entries now. Finally, the State Processor API opens up many ways to evolve a stateful application that were previously blocked by parameter and design choices that could not be changed without losing all the state of the application after it was started. For example, you can now arbitrarily modify the data types of states, adjust the maximum parallelism of operators, split or merge operator state, re-assign operator UIDs, and so on

Read on to learn more about how this works.

Comments closed