Category: Streaming

Flink And Kafka Streams

Published 2016-09-01 by Kevin Feasel

Neha Narkhede and Stephan Ewen compare Apache Flink versus Kafka Streams:

Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order oftens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as f ew 10s of milliseconds , (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.

The upshot is that the two products don’t do exactly the same thing, and there might be room in your organization for the two of them.

Comments closed

Data Access And Streaming

Published 2016-08-24 by Kevin Feasel

Kartik Paramasivam discusses data access problems and solutions within a streaming architecture:

Using a remote store: This is the traditional model for building applications. Here, when an application needs to process an event, it makes a remote call to a separate SQL or No-SQL database. In this model, write operations are always remote calls, but reads can be performed on a local cache in certain scenarios. There are a large number of applications at LinkedIn that fall into this category.

Another pattern is to use a remote cache (e.g., Couchbase) that is fronting a remote database (e.g., Oracle). If the remote cache is used primarily for reading adjunct data, then applications use an Oracle change capture stream (using Databus) to populate the remote cache.

This is a must-read if you’re looking at implementing a streaming architecture and need to do any kind of data enrichment.

Comments closed

Reprocessing Kafka Stream Data

Published 2016-08-22 by Kevin Feasel

Matthias J Sax shows how to reprocess input data using Kafka Streams:

In this blog post we describe how to tell a Kafka Streams application to reprocess its input data from scratch. This is actually a very common situation when you are implementing stream processing applications in practice, and it might be required for a number of reasons, including but not limited to: during development and testing, when addressing bugs in production, when doing A/B testing of algorithms and campaigns, when giving demos to customers or internal stakeholders, and so on.

The quick answer is you can do this either manually (cumbersome and error-prone) or you can use the new application reset tool for Kafka Streams, which is an easy-to-use solution for the problem. The application reset tool is available starting with upcoming Confluent Platform 3.0.1 and Apache Kafka 0.10.0.1.

In the first part of this post we explain how to use the new application reset tool. In the second part we discuss what is required for a proper (manual) reset of a Kafka Streams application. This parts includes a deep dive into relevant Kafka Streams internals, namely internal topics, operator state, and offset commits. As you will see, these details make manually resetting an application a bit complex, hence the motivation to create an easy to use application reset tool.

Being able to reprocess streams is a critical part of the Kappa architecture, and this article is a nice overview of how to do that if you’re using Kafka Streams.

Comments closed

Flink: Streams Versus Batches

Published 2016-08-18 by Kevin Feasel

Kevin Jacobs has an article comparing Apache Flink to Spark Streaming:

The other type of data are data streams. Data streams can be visualized by water flowing from a tap to a sink. This process is not ending. The nice property of streams is that you can consume the stream while it is flowing. There is almost no latency involved for consuming a stream.

Apache Spark is fundamentally based on batches of data. By that, for all processing jobs at least some latency is introduced. Apache Flink on the other hand is fundamentally based on streams. Let’s take a look at some evidence for the difference in latency.

Read the whole thing.

Comments closed

Streaming Data To Power BI

Published 2016-08-18 by Kevin Feasel

Reza Rad shows how to hook up streaming data sources to Power BI:

As I mentioned before, download the sample project. and then select the project under this path:

samples/consoleapp/getting-started-for-dotnet-real-time/PBIRealTimeStreaming

And open the PBIRealTimeStreaming solution in Visual Studio. This project creates a data set including a datetime value and a numeric value and will pass that through Power BI API (which will be discussed later in another post) to Power BI service.

Real-time dashboards are great for making it look like you’re doing Very Important Things. They can also be useful in other ways too. Read the whole thing.

Comments closed

Kinesis Analytics

Published 2016-08-15 by Kevin Feasel

Ryan Nienhuis shows how to write SQL against Amazon Kinesis queues:

In your application code, you interact primarily with in-application streams. For instance, a source in-application stream represents your configured Amazon Kinesis stream or Firehose delivery stream in the application, which by default is named “SOURCE_SQL_STREAM_001”. A destination in-application stream represents your configured destinations, which by default is named “DESTINATION_SQL_STREAM”. When interacting with in-application streams, the following is true:

The SELECT statement is used in the context of an INSERT statement. That is, when you select rows from one in-application stream, you insert results into another in-application stream.

The INSERT statement is always used in the context of a pump. That is, you use pumps to write to an in-application stream. A pump is the mechanism used to make an INSERT statement continuous.

There are two separate SQL statements in the template you selected in the first walkthrough. The first statement creates a target in-application stream for the SQL results; the second statement creates a PUMP for inserting into that stream and includes the SELECT statement.

This is worth looking into if you use AWS and have a need for streaming data.

Comments closed

Structured Streaming

Published 2016-08-03 by Kevin Feasel

Matei Zaharia, et al discuss how to use structured streaming in Apache Spark 2.0:

In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).

If you want to learn more about streaming data using Spark, check this out.

Comments closed

Detecting Web Traffic Anomalies

Published 2016-06-27 by Kevin Feasel

Jan Kunigk combines a few Apache products to perform near-real-time analysis of web traffic data:

meinestadt.de web servers generate up to 20 million user sessions per day, which can easily result in up to several thousand HTTP GET requests per second during peak times (and expected to scale to much higher volumes in the future). Although there is a permanent fraction of bad requests, at times the number of bad requests jumps.

The meinestadt.de approach is to use a Spark Streaming application to feed an Impala table every n minutes with the current counts of HTTP status codes within the n minutes window. Analysts and engineers query the table via standard BI tools to detect bad requests.

What follows is a fairly detailed architectural walkthrough as well as configuration and implementation work. It’s a fairly long read, but if you’re interested in delving into Hadoop, it’s a good place to start.

Comments closed

Kafka And MapR Streams

Published 2016-05-18 by Kevin Feasel

Ellen Friedman compares and contrasts Apache Kafka with MapR streams:

What’s the difference in MapR Streams and Kafka Streams?

This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.

Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.

This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30