Category: Streaming

Serialization in Apache Flink

Published 2020-04-15 by Kevin Feasel

Nico Kruber walks us through the viable set of serializers in Apache Flink:

Flink handles data types and serialization with its own type descriptors, generic type extraction, and type serialization framework. We recommend reading through the documentation first in order to be able to follow the arguments we present below. In essence, Flink tries to infer information about your job’s data types for wire and state serialization, and to be able to use grouping, joining, and aggregation operations by referring to individual field names, e.g. stream.keyBy(“ruleId”) or dataSet.join(another).where("name").equalTo("personName"). It also allows optimizations in the serialization format as well as reducing unnecessary de/serializations (mainly in certain Batch operations as well as in the SQL/Table APIs).

Click through for notes on each serializer and a graph which shows how the choice of a serializer can make a huge difference.

Comments closed

Stateful Functions in Apache Flink

Published 2020-04-09 by Kevin Feasel

Stephan Ewen announces Stateful Functions 2.0:

Today, we are announcing the release of Stateful Functions (StateFun) 2.0 — the first release of Stateful Functions as part of the Apache Flink project. This release marks a big milestone: Stateful Functions 2.0 is not only an API update, but the first version of an event-driven database that is built on Apache Flink.
Stateful Functions 2.0 makes it possible to combine StateFun’s powerful approach to state and composition with the elasticity, rapid scaling/scale-to-zero and rolling upgrade capabilities of FaaS implementations like AWS Lambda and modern resource orchestration frameworks like Kubernetes.
With these features, Stateful Functions 2.0 addresses two of the most cited shortcomings of many FaaS setups today: consistent state and efficient messaging between functions.

Read on to see how it works.

Comments closed

Using Apache Flink to Read from Apache Kafka

Published 2020-04-07 by Kevin Feasel

Preetdeep Kumar crosses the streams:

Apache Flink provides various connectors to integrate with other systems. In this article, I will share an example of consuming records from Kafka through FlinkKafkaConsumer and producing records to Kafka using FlinkKafkaProducer.

Read on for an example. I’m glad to see that integration between these two competitors (more exactly, Flink and Kafka Streams are competitors) is so easy.

Comments closed

The Flink-Hive Integration

Published 2020-04-01 by Kevin Feasel

Bowen Li takes us through Apache Flink 1.10’s integration with Apache Hive:

On the other hand, Apache Hive has established itself as a focal point of the data warehousing ecosystem. It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered and defined. As business evolves, it puts new requirements on data warehouse.
Thus we started integrating Flink and Hive as a beta version in Flink 1.9. Over the past few months, we have been listening to users’ requests and feedback, extensively enhancing our product, and running rigorous benchmarks (which will be published soon separately). I’m glad to announce that the integration between Flink and Hive is at production grade in Flink 1.10 and we can’t wait to walk you through the details.

Click through to see how it works.

Comments closed

Storing Streaming Data in Azure Data Lake

Published 2020-03-25 by Kevin Feasel

Jesse Gorter takes us through writing streaming data from Event Hubs into Azure Data Lake Storage:

In my previous blog I showed how you can stream Twitter data to an Event Hub and stream the data to a Power BI live dashboard. In this post, I am going to show you how to store this data for long term storage. An Event Hub stores your events temporarily. That means it does not store them for later analysis. Say you want to analyze whether negative or positive tweets have an impact on your sales, you would need to store tweets for a historical view.
The question is where to store this data: directly to the datawarehouse, or store it to a data lake? This really depends on the architecture that you want to have. A data lake is often used to store the raw data historically. Is is especially interesting because it allows to store any kind of data, structured or unstructured and it is quite cheap compared to Azure SQL database or Azure SQL datawarehouse. So for that reason, we are going to store it in a data lake.

Jesse walks us through data lake creation and data migration from Event Hubs into a Data Lake Storage container.

Comments closed

Monitoring Data Quality on Streaming Data

Published 2020-03-10 by Kevin Feasel

Abraham Pabbathi and Greg Wood want to check data quality on Spark Streaming data:

While the emergence of streaming in the mainstream is a net positive, there is some baggage that comes along with this architecture. In particular, there has historically been a tradeoff: high-quality data, or high-velocity data? In reality, this is not a valid question; quality must be coupled to velocity for all practical means — to achieve high velocity, we need high quality data. After all, low quality at high velocity will require reprocessing, often in batch; low velocity at high quality, on the other hand, fails to meet the needs of many modern problems. As more companies adopt streaming as a lynchpin for their processing architectures, both velocity and quality must improve.
In this blog post, we’ll dive into one data management architecture that can be used to combat corrupt or bad data in streams by proactively monitoring and analyzing data as it arrives without causing bottlenecks.

This was one of the sticking points of the lambda architecture: new data could still be incomplete and possibly wrong, but until reached the batch layer, you wouldn’t know that.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Published 2020-02-27 by Kevin Feasel

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed

Loading Data into Delta Lake

Published 2020-02-27 by Kevin Feasel

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Published 2020-02-25 by Kevin Feasel

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.
This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Creating Sources and Sinks with Blink

Published 2020-02-21 by Kevin Feasel

Seth Wiesman has a tutorial showing how to create sources and sinks using Apache Flink’s SQL interface, Blink:

A lot of work focused on improving runtime performance and progressively extending its coverage of the SQL standard. Flink now supports the full TPC-DS query set for batch queries, reflecting the readiness of its SQL engine to address the needs of modern data warehouse-like workloads. Its streaming SQL supports an almost equal set of features – those that are well defined on a streaming runtime – including complex joins and MATCH_RECOGNIZE.
As important as this work is, the community also strives to make these features generally accessible to the broadest audience possible. That is why the Flink community is excited in 1.10 to offer production-ready DDL syntax (e.g., CREATE TABLE, DROP TABLE) and a refactored catalog interface.

Click through for a demonstration. One of the nicest things about the ANSI SQL standard is that it was intended to be a one-language solution, where the language used for administration is the same as the language used for regular queries. That cuts down on the number of languages you need to know to get your job done.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31