Press "Enter" to skip to content

Category: Streaming

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.

This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Creating Sources and Sinks with Blink

Seth Wiesman has a tutorial showing how to create sources and sinks using Apache Flink’s SQL interface, Blink:

A lot of work focused on improving runtime performance and progressively extending its coverage of the SQL standard. Flink now supports the full TPC-DS query set for batch queries, reflecting the readiness of its SQL engine to address the needs of modern data warehouse-like workloads. Its streaming SQL supports an almost equal set of features – those that are well defined on a streaming runtime – including complex joins and MATCH_RECOGNIZE.

As important as this work is, the community also strives to make these features generally accessible to the broadest audience possible. That is why the Flink community is excited in 1.10 to offer production-ready DDL syntax (e.g., CREATE TABLEDROP TABLE) and a refactored catalog interface.

Click through for a demonstration. One of the nicest things about the ANSI SQL standard is that it was intended to be a one-language solution, where the language used for administration is the same as the language used for regular queries. That cuts down on the number of languages you need to know to get your job done.

Comments closed

Flink 1.10.0 Released

Marta Paes announces the release of Apache Flink 1.10.0:

The Apache Flink community is excited to hit the double digits and announce the release of Flink 1.10.0! As a result of the biggest community effort to date, with over 1.2k issues implemented and more than 200 contributors, this release introduces significant improvements to the overall performance and stability of Flink jobs, a preview of native Kubernetes integration and great advances in Python support (PyFlink).

Flink 1.10 also marks the completion of the Blink integration, hardening streaming SQL and bringing mature batch processing to Flink with production-ready Hive integration and TPC-DS coverage. This blog post describes all major new features and improvements, important changes to be aware of and what to expect moving forward.

Read on for the improvements and let me once more point out the validation of Feasel’s Law.

Comments closed

Building a Cache in ksqlDB

Michael Drogalis shows how to build a materialized cache to reduce the load on your Kafka Streams servers:

There are a lot of ways that you can introduce a materialized cache into your architecture. One such way is to leverage ksqlDB, an event streaming database purpose-built for stream processing applications. With native Kafka integration, ksqlDB makes it easy to replicate the pattern of scaling out many sets of distributed caches.

Let’s look at how this works in action with an example application. Imagine that you have a database storing geospatial data of pings from drivers at a ridesharing company. You have a particular piece of logic that you want to move out of the database—a frequently run query to aggregate how active a territory is. You can build a materialized cache for it using ksqlDB.

The tutorial starts you from “grab the Docker container” and takes you through the process.

Comments closed

Unit Testing in Apache Flink

Kartik Khare has a guide to help us write unit tests for our Apache Flink code:

Writing tests for process functions, that work with time, is quite similar to writing tests for stateful functions because you can also use test harness. However, you need to take care of another aspect, which is providing timestamps for events and controlling the current time of the application. By setting the current (processing or event) time, you can trigger registered timers, which will call the onTimer method of the function

Click through for demos and more details on the test harness.

Comments closed

Apache Flink 1.9.2 Released

Hequn Cheng announces Apache Flink 1.9.2:

The Apache Flink community released the second bugfix version of the Apache Flink 1.9 series.

This release includes 117 fixes and minor improvements for Flink 1.9.1. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.9.2.

Read on to see all of the things they’ve fixed and improved.

Comments closed

Log Aggregation with Apache Flink

Gyula Fora and Matyas Orhidi have started a series on log aggregation with Apache Flink:

There are several off-the-shelf solutions available on the market for log aggregation, which come with their own stack of components and operational difficulties. For example, notable logging frameworks that are widely used in the industry are ELK stack and Graylog. 

Unfortunately, there is no clear cut solution that works for every application, and different logging solutions might be more suitable for certain use cases. The log processing of real-time applications should for instance also happen in real-time, otherwise, we lose timely information that may be required to successfully operate the system.

In this blog post, we dive deep into logging for real-time applications.

This post is mostly understanding and setup, but it leads into processing and visualization.

Comments closed

Fraud Detection with Flink

Alexander Fedulov gives us a case study of using Apache Flink for fraud detection:

In this blog post, we have discussed the motivation behind supporting dynamic, runtime changes to a Flink application by looking at a sample use case – a Fraud Detection engine. We have described the overall architecture and interactions between its components as well as provided references for building and running a demo Fraud Detection application in a dockerized setup. We then showed the details of implementing a dynamic data partitioning pattern as the first underlying building block to enable flexible runtime configurations.

To remain focused on describing the core mechanics of the pattern, we kept the complexity of the DSL and the underlying rules engine to a minimum. Going forward, it is easy to imagine adding extensions such as allowing more sophisticated rule definitions, including filtering of certain events, logical rules chaining, and other more advanced functionality.

It was an interesting discussion and you can grab the code as well.

Comments closed

Streams and Tables in Apache Kafka

Michael Noll wraps up a series on Apache Kafka. First up is the fundamentals of Kafka Streams:

table is a, well, table in the ordinary technical sense of the word, and we have already talked a bit about tables before (a table in Kafka is today more like an RDBMS materialized view than an RDBMS table, because it relies on a change being made elsewhere rather than being directly updatable itself). Seen through the lens of event streaming however, a table is also an aggregated stream. This is a reference to the stream-table duality we discussed in part 1.

In the conclusion, Michael covers a few advanced topics:

Streams and tables are always fault tolerant because their data is stored reliably and durably in Kafka. This should be relatively easy to understand for streams by now as they map to Kafka topics in a straightforward manner. If something breaks while processing a stream, then we just need to re-read the underlying topic again.

For tables, it is more complex because they must maintain additional information—their state—to allow for stateful processing such as joins and aggregations like COUNT() or SUM(). To achieve this while also ensuring high processing performance, tables (through their state stores) are materialized on local disk within a Kafka Streams application instance or a ksqlDB server. But machines and containers can be lost, along with any locally stored data. How can we make tables fault tolerant, too?

This was a nice series.

Comments closed