Flink And Kafka Streams

Neha Narkhede and Stephan Ewen compare Apache Flink versus Kafka Streams:

Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order oftens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as few 10s of milliseconds, (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.

The upshot is that the two products don’t do exactly the same thing, and there might be room in your organization for the two of them.

Related Posts

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

Replicating Data In HDFS Between Clusters

Murali Ramasami and Niru Anisetti have an article showing how to use the Hortonworks Data Lifecycle Manager to set up replication between two Hadoop clusters: Data Lifecycle Manager (DLM) delivers on the promise of location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments. This empowers businesses to […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930