Press "Enter" to skip to content

Category: Streaming

Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Fokko Driesprong gives us a proof of concept architecture for real-time analytics in the Hadoop ecosystem:

Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.

Divolte can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems. By using a JavaScript tag in the browser of the customers, it gathers data about their behaviour on the website or application. You’re in full control what you do, and don’t want to capture.

Click through for the example.

Comments closed

Apache Flink 1.9 Released

The Apache Flink crew announces version 1.9.0:

The Apache Flink project’s goal is to develop a stream processing system to unify and power many forms of real-time and offline data processing applications as well as event-driven applications. In this release, we have made a huge step forward in that effort, by integrating Flink’s stream and batch processing capabilities under a single, unified runtime.

Significant features on this path are batch-style recovery for batch jobs and a preview of the new Blink-based query engine for Table API and SQL queries. We are also excited to announce the availability of the State Processor API, which is one of the most frequently requested features and enables users to read and write savepoints with Flink DataSet jobs. Finally, Flink 1.9 includes a reworked WebUI and previews of Flink’s new Python Table API and its integration with the Apache Hive ecosystem.

Click through for the major changes.

Comments closed

The Basics of Apache Airflow

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Cloudera Stream Processing

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor:

Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on real-time streaming data using Apache Kafka. CSP also supports key management and monitoring capabilities powered by Cloudera Streams Management (CSM).

Sounds like they’re taking the Kafka route over Spark Streaming, Flink, Airflow, etc.

Comments closed

Kafka 2.3 and Kafka Connect Improvements

Robin Moffatt goes over improvements in Kafka Connect with the release of Apache Kafka 2.3:

A Kafka Connect cluster is made up of one or more worker processes, and the cluster distributes the work of connectors as tasks. When a connector or worker is added or removed, Kafka Connect will attempt to rebalance these tasks. Before version 2.3 of Kafka, the cluster stopped all tasks, recomputed where to run all tasks, and then started everything again. Each rebalance halted all ingest and egress work for usually short periods of time, but also sometimes for a not insignificant duration of time.

Now with KIP-415, Apache Kafka 2.3 instead uses incremental cooperative rebalancing, which rebalances only those tasks that need to be started, stopped, or moved. For more details, there are available resources that you can readlisten, and watch, or you can hear the lead engineer on the work, Konstantine Karantasis, talk about it in person at the upcoming Kafka Summit.

Looks like some nice improvements here.

Comments closed

An Apache Flume Overview

Daniel Berman takes us through an overview of Apache Flume:

Apache Flume was developed by Cloudera to provide a way to quickly and reliably stream large volumes of log files generated by web servers into Hadoop. There, applications can perform further analysis on data in a distributed environment. Initially, Apache Flume was developed to handle only log data. Later, it was equipped to handle event data as well.

Click through to get a code-free, high-level understanding of Flume and where it can work for you.

Comments closed

KSQL UDFs

Mitch Seymour takes us through user-defined functions in Kafka’s flavor of SQL:

One of KSQL’s most powerful features is allowing users to build their own KSQL functions for processing real-time streams of data. These functions can be invoked on individual messages (user-defined functions or UDFs) or used to perform aggregations on groups of messages (user-defined aggregate functions or UDAFs).

The previous blog post How to Build a UDF and/or UDAF in KSQL 5.0 discussed some key steps for building and deploying a custom KSQL UDF/UDAF. Now with Confluent Platform 5.3.0, creating custom KSQL functions is even easier when you leverage Maven, a tool for building and managing dependencies in Java projects.

Read on to see just how easy it is.

Comments closed

Stream Processing with Kafka

Satish Sharma has a four-part series covering stream processing with Apache Kafka. Part 1 gives us an overview of Kafka:

Apache Kafka is an open-source distributed stream processing platform originally developed by LinkedIn and later donated to Apache in 2011.

We can describe Kafka as a collection of files, filled with messages that are distributed across multiple machines. Most of Kafka analogies revolve around tying these various individual logs together, routing messages from producers to consumers reliably, replicating for fault tolerance, and handling failure gracefully. Its architecture inherits more from storage systems like HDFS, HBase, or Cassandra than it does from traditional messaging systems that implement JMS or AMQP. The underlying abstraction is a partitioned log, essentially a set of append-only files spread over several machines. This encourages sequential access patterns. A Kafka cluster is a distributed system that spreads data over many machines both for fault tolerance and for linear scale-out.

Part 2 covers terminology and concepts:

Kafka Streams API
Kafka Streams API is a Java library that allows you to build real-time applications. These applications can be packaged, deployed, and monitored like any other Java application — there is no need to install separate processing clusters or similar special-purpose and expensive infrastructures!

The Streams API is scalable, lightweight, and fault-tolerant; it is stateless and allows for stateful processing. 

Part 3 has you install and configure Kafka:

For quick testing, let’s start a handy console consumer, which reads messages from a specified topic and displays them back on the console. We will use the same to consumer to read all of our messages from this point forward. Use the following command: 

Linux -> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Windows -> bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Part 4 is forthcoming.

Comments closed

Monitoring Backpressure in Apache Flink

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:

– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.

– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

Flink’s Network Stack

Nico Kruber dives into the internals of Apache Flink’s network stack:

Flink’s network stack is one of the core components that make up the flink-runtime module and sit at the heart of every Flink job. It connects individual work units (subtasks) from all TaskManagers. This is where your streamed-in data flows through and it is therefore crucial to the performance of your Flink job for both the throughput as well as latency you observe. In contrast to the coordination channels between TaskManagers and JobManagers which are using RPCs via Akka, the network stack between TaskManagers relies on a much lower-level API using Netty.

This blog post is the first in a series of posts about the network stack. In the sections below, we will first have a high-level look at what abstractions are exposed to the stream operators and then go into detail on the physical implementation and various optimisations Flink did. We will briefly present the result of these optimisations and Flink’s trade-off between throughput and latency. Future blog posts in this series will elaborate more on monitoring and metrics, tuning parameters, and common anti-patterns.

There’s a lot in here and it’s worth reading.

Comments closed