Press "Enter" to skip to content

Category: Streaming

Flink 1.15 Released

Joe Moser and Yun Gao announce Apache Flink 1.15:

Thanks to our well-organized and open community, Apache Flink continues to grow as a technology and remain one of the most active projects in the Apache community. With the release of Flink 1.15, we are proud to announce a number of exciting changes.

One of the main concepts that makes Apache Flink stand out is the unification of batch (aka bounded) and stream (aka unbounded) data processing, which helps reduce the complexity of development. A lot of effort went into this unification in the previous releases, and you can expect more efforts in this direction.

Apache Flink is not only growing when it comes to contributions and users, but also out of the original use cases. We are seeing a trend towards more business/analytics use cases implemented in low-/no-code. Flink SQL is the feature in the Flink ecosystem that enables such uses cases and this is why its popularity continues to grow.

Flink SQL is Feasel’s Law in action.

Comments closed

Streaming Data into Synapse Dedicated SQL Pool

Lionel Penuchot loads some data:

This article reviews a common pattern of streaming data (i.e. real-time message ingestion) in Synapse dedicated pool. It opens a discussion on the simple standard way to implement this, as well as the challenges and drawbacks. It then presents an alternate solution which enables optimal performance and greatly reduces maintenance tasks when using clustered column store indexes. This is aimed at developers, DBAs, architects, and anyone who works with streams of data that are captured in real-time.

I’d probably avoid the MERGE statement in there because of how many problems there are with it. That said, this is a useful pattern for trickle-loading columnstore tables.

Comments closed

From Confluent Cloud into Azure Synapse Analytics

Jacob Bogie and Dustin Vannoy show how to integrate Kafka in Confluent Cloud with pools in Azure Synapse Analytics:

Just released this fall, is the fully managed Synapse Connector. Azure Synapse Analytics provides a platform for data analysts and data scientists to analyze and combine data from multiple sources. Within Confluent Cloud, data can be synched to dedicated SQL pools via the fully managed Synapse sink connector and attached to Synapse Analytics workspace. Once added to the Synapse Analytics workspace, analysts have the ability to perform advanced analytics and reporting on data in the Confluent pipeline. The ability to access event-level data enables event-level analytics and data exploration.

Click through for two examples, one of loading data into a dedicated SQL pool and one of streaming data into Spark Streaming running on (naturally) a Spark pool.

Comments closed

Replacing Zookeeper in Kafka

Guozhang Wang explains the decision-making behind a major change in Apache Kafka:

Why replace ZooKeeper with an internal log for Apache Kafka® metadata management? This post explores the rationale behind the replacement, examines why a quorum-based consensus protocol like Raft was utilized and altered to become KRaft, and describes the new Quorum Controller built on top of KRaft protocols.

Click through for the reasoning, which includes a considerably faster shutdown in large environments..

Comments closed

Dynamic DAGs with Apache Airflow

Bhavya Garg explains how we can create dynamic directed acyclic graphs in Apache Airflow:

Airflow dynamic DAGs can save you a ton of time. As you know, Apache Airflow is written in Python, and DAGs are created via Python scripts. That makes it very flexible and powerful (even complex sometimes). By leveraging Python, you can create DAGs dynamically based on variables, connections, a typical pattern, etc. This very nice way of generating DAGs comes at the price of higher complexity and subtle tricky things that you must know

Read on for an example.

Comments closed

Making Kafka Clients Faster

Yeva Byzek has a few whitepaper recommendations:

Over the years, incredible technical content has been written about data plane performance, general principles and tradeoffs, cloud-native architectures, etc. These writings describe how you can get low latency and high throughput without compromising on a mature and reliable platform that provides persistence, no data loss, audit logs, processing logs, and more—all the things that enable you to go from proof of concept to production. This blog post highlights the top five reading recommendations to help you gain a deeper understanding of what makes applications that run on Confluent Cloud so fast. They cover the key concepts and provide concrete examples of how we do it, and how you can do it too, with specific benchmark testing and configuration guidelines.

Click through for links to those resources.

Comments closed

De-Scalafication in Flink

Seth Wiesman has a post leaving me feeling a little bittersweet:

If you have worked with a JVM-based application, you have probably heard the term classpath. The classpath defines where the JVM will search for a given classfile when it needs to be loaded. There may only be one instance of a classfile on each classpath, forcing any dependency Flink exposes onto users. That is why the Flink community works hard to keep our classpath “clean” – or free of unnecessary dependencies. We achieve this through a combination of shaded dependencieschild first class loading, and a plugins abstraction for optional components.

The Apache Flink runtime is primarily written in Java but contains critical components that forced Scala on the default classpath. And because Scala does not maintain binary compatibility across minor releases, this historically required cross-building components for all versions of Scala. But due to many reasons – breaking changes in the compilera new standard library, and a reworked macro system – this was easier said than done.

They did it, which means less Scala in the code base. But it also means that you aren’t tied to a particular version of Scala in your own code. I’m happy about it on the whole but it does expose a frustrating pain point with Scala.

Comments closed

Apache Flink and Delta Lake

Max Fisher and Dylan Gessner use Flink to load data in Delta Lake:

As with all parts of our platform, we are constantly raising the bar and adding new features to enhance developers’ abilities to build the applications that will make their Lakehouse a reality. Building real-time applications on Databricks is no exception. Features like asynchronous checkpointingsession windows, and Delta Live Tables allow organizations to build even more powerful, real-time pipelines on Databricks using Delta Lake as the foundation for all the data that flows through the Lakehouse.

However, for organizations that leverage Flink for real-time transformations, it might appear that they are unable to take advantage of some of the great Delta Lake and Databricks features, but that is not the case. In this blog we will explore how Flink developers can build pipelines to integrate their Flink applications into the broader Lakehouse architecture.

Click through for two methods of doing so.

Comments closed

Streaming Data to Event Hubs via Kafka Connect and Debezium

Niels Berglund starts off a two-part sub-series within a series:

This post is the first of two looking at if and how we can stream data to Event Hubs from Debezium. Initially I had planned only one post covering this, but it turned out that the post would be too long, so therefore I split it in two.

It started with the post, How to Use Kafka Client with Azure Event Hubs. In that post, I looked at how the Kafka client can publish messages to – not only – Apache Kafka but also Azure Event Hubs. In the post, I said something like:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

Click through for the first part of this pairing.

Comments closed

Apache Flink ML 2.0.0

Dong Lin and Yun Gao make an announcement:

The Apache Flink community is excited to announce the release of Flink ML 2.0.0! Flink ML is a library that provides APIs and infrastructure for building stream-batch unified machine learning algorithms, that can be easy-to-use and performant with (near-) real-time latency.

This release involves a major refactor of the earlier Flink ML library and introduces major features that extend the Flink ML API and the iteration runtime, such as supporting stages with multi-input multi-output, graph-based stage composition, and a new stream-batch unified iteration library. Moreover, we added five algorithm implementations in this release, which is the start of a long-term initiative to provide a large number of off-the-shelf algorithms in Flink ML with state-of-the-art performance.

Congratulations to everybody who contributed to the project; it’s a big milestone.

Comments closed