Press "Enter" to skip to content

Category: Streaming

Building a Stream Processing App with ksql

The Hadoop in Real World team walks us through event streaming with ksql:

ksqlDB is an event streaming database that enables creating powerful stream processing applications on top of Apache Kafka by using the familiar SQL syntax, which is referred to as KSQL. This is a powerful concept that abstracts away much of the complexity of stream processing from the user. Business users or analysts with SQL background can query the complex data structures passing through kafka and get real-time insights. In this article, we are going to see how to set up ksqlDB and also look at important concepts in ksql and its usage.

Event streaming has become a lot easier over the past couple of years, as Kafka, Spark, and Flink have all matured.

Comments closed

Bot-Building with ksqlDB

Robin Moffatt has an interesting project for us:

But what if you didn’t need any datastore other than Kafka itself? What if you could ingest, filter, enrich, aggregate, and query data with just Kafka? With ksqlDB we can do just this, and I want to show you exactly how.

We’re going to build a simple system that captures Wi-Fi packets, processes them, and serves up on-demand information about the devices connecting to Wi-Fi. The “secret sauce” here is ksqlDB’s ability to build stateful aggregates that can be directly accessed using pull queries. This is going to power a very simple bot for the messaging platform Telegram, which takes a unique device name as input and returns statistics about its Wi-Fi probe activities to the user:

Click through for the tutorial.

Comments closed

Project Metamorphosis: Elastic Kafka Clusters

Jay Kreps explains what Confluent has been up to lately:

What is Project Metamorphosis?

Let me try to explain. I think there are two big shifts happening in the world of data right now, and Project Metamorphosis is an attempt to bring those two things together.

The first one, and the one that Confluent is known for, is the move to event streaming.

Event streams are a real revolution in how we think about and use data, and we think they are going to be at the core of one of the most important data platforms in a modern company. Our goal at Confluent is to build the infrastructure that makes that possible and help the world take advantage of it. That’s why we exist.

But event streaming isn’t the only paradigm shift we’re in the midst of. The other change comes from the movement to the cloud.

Click through for the high-level. I can see this even more directly competing with Kinesis and Event Hubs.

Comments closed

Technology Choices for Streaming Pipelines

The Hadoop in Real World team takes us through different tools available when working on streaming pipelines:

Businesses want to get insights as quickly as possible and do not want to wait for a day, like before, to bring up a report to understand what happened till yesterday. They require a more proactive approach that can help to act immediately when something significant happens and also to prevent the system from any faults/downtime before it occurs. Imagine you are buying some product from an e-retailer and you have gone till the point to make payment and something happened that caused the payment not to go through successfully. At that very moment, you are having a second thought about whether to buy the product now or later. Suppose, if the business is getting a report of this occurrence next day, it would not be of much use for them as the customer would have already bought it from somewhere or decided against it. This is where real-time events and insights come in. If it were a real-time report, the team would have called up the customer and made the purchase by offering some discounts, which in turn would have changed the mind of the customer.

Click through for a high-level discussion of these tools.

Comments closed

Memory Management in Flink 1.10

Andrey Zagrebin walks us through some memory management improvements in the most recent version of Apache Flink:

Apache Flink 1.10 comes with significant changes to the memory model of the Task Managers and configuration options for your Flink applications. These recently-introduced changes make Flink more adaptable to all kinds of deployment environments (e.g. Kubernetes, Yarn, Mesos), providing strict control over its memory consumption. In this post, we describe Flink’s memory model, as it stands in Flink 1.10, how to set up and manage memory consumption of your Flink applications and the recent changes the community implemented in the latest Apache Flink release.

Click through to learn about the current model and methods to control memory utilization.

Comments closed

Serialization in Apache Flink

Nico Kruber walks us through the viable set of serializers in Apache Flink:

Flink handles data types and serialization with its own type descriptors, generic type extraction, and type serialization framework. We recommend reading through the documentation first in order to be able to follow the arguments we present below. In essence, Flink tries to infer information about your job’s data types for wire and state serialization, and to be able to use grouping, joining, and aggregation operations by referring to individual field names, e.g. stream.keyBy(“ruleId”) or dataSet.join(another).where("name").equalTo("personName"). It also allows optimizations in the serialization format as well as reducing unnecessary de/serializations (mainly in certain Batch operations as well as in the SQL/Table APIs).

Click through for notes on each serializer and a graph which shows how the choice of a serializer can make a huge difference.

Comments closed

Stateful Functions in Apache Flink

Stephan Ewen announces Stateful Functions 2.0:

Today, we are announcing the release of Stateful Functions (StateFun) 2.0 — the first release of Stateful Functions as part of the Apache Flink project. This release marks a big milestone: Stateful Functions 2.0 is not only an API update, but the first version of an event-driven database that is built on Apache Flink.

Stateful Functions 2.0 makes it possible to combine StateFun’s powerful approach to state and composition with the elasticity, rapid scaling/scale-to-zero and rolling upgrade capabilities of FaaS implementations like AWS Lambda and modern resource orchestration frameworks like Kubernetes.

With these features, Stateful Functions 2.0 addresses two of the most cited shortcomings of many FaaS setups today: consistent state and efficient messaging between functions.

Read on to see how it works.

Comments closed

Using Apache Flink to Read from Apache Kafka

Preetdeep Kumar crosses the streams:

Apache Flink provides various connectors to integrate with other systems. In this article, I will share an example of consuming records from Kafka through FlinkKafkaConsumer and producing records to Kafka using FlinkKafkaProducer.

Read on for an example. I’m glad to see that integration between these two competitors (more exactly, Flink and Kafka Streams are competitors) is so easy.

Comments closed

The Flink-Hive Integration

Bowen Li takes us through Apache Flink 1.10’s integration with Apache Hive:

On the other hand, Apache Hive has established itself as a focal point of the data warehousing ecosystem. It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered and defined. As business evolves, it puts new requirements on data warehouse.

Thus we started integrating Flink and Hive as a beta version in Flink 1.9. Over the past few months, we have been listening to users’ requests and feedback, extensively enhancing our product, and running rigorous benchmarks (which will be published soon separately). I’m glad to announce that the integration between Flink and Hive is at production grade in Flink 1.10 and we can’t wait to walk you through the details.

Click through to see how it works.

Comments closed