Flink: Batch as a Special Case of Streaming

Fabian Hueske and Aljoscha Krettek describe streaming versus batch processing in Apache Flink:

The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.

Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.

Check it out. At the end, the authors also describe Blink, a fork of Flink being (slowly) merged back in and which supports this paradigm.

Related Posts

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming: Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. […]

Read More

Multi-Region Replication with Confluent Platform

David Arthur walks us through multi-region replication of Kafka clusters in the Confluent Platform 5.4 preview: Running a single Apache Kafka® cluster across multiple datacenters (DCs) is a common, yet somewhat taboo architecture. This architecture, referred to as a stretch cluster, provides several operational benefits and unlocks the door to many uses cases. Stretch clusters provide […]

Read More

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930