Category: Streaming

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Comments closed

Filtering On Kafka Streams

Published 2018-04-06 by Kevin Feasel

Robin Moffatt has a new series showing how to use Kafka Streams for dealing with syslog data:

syslog is one of those ubiquitous standards on which much of modern computing runs. Built into operating systems such as Linux, it’s also commonplace in networking and IoT devices like IP cameras. It provides a way for streaming log messages, along with metadata such as the source host, severity of the message, and so on. Sometimes the target is simply a local logfile, but more often it’s a centralised syslog server which in turn may log or process the messages further.

As a high-performance, distributed streaming platform, Apache Kafka® is a great tool for centralised ingestion of syslog data. Since Apache Kafka also persists data and supports native stream processing we don’t need to land it elsewhere before we can utilise the data. You can stream syslog data into Kafka in a variety of ways, including through Kafka Connect for which there is a dedicated syslog plugin.

In this post, we’re going to see how KSQL can be used to process syslog messages as they arrive in realtime.

Check it out.

Comments closed

Continuous Processing Mode With Spark Structured Streaming

Published 2018-03-26 by Kevin Feasel

Joseph Torres, et al, explain how continuous processing mode works with Apache Spark 2.3’s structured streaming:

Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.

Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays.

They also explain how this newer model differs from the prior model of collecting events in microbatches.

Comments closed

Joining Multiple Types Of Data With KSQL

Published 2018-03-23 by Kevin Feasel

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL:

This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE clause), or even aggregations.

You can see which queries are running with the SHOW QUERIES; statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.

The DESCRIBE EXTENDED command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.

It’s pretty easy to do; click through to see just how easy.

Comments closed

Securing KSQL

Published 2018-02-23 by Kevin Feasel

Yeva Byzek shows the methods available to secure a Kafka Streams application:

To connect to a secured Kafka cluster, Kafka client applications need to provide their security credentials. In the same way, we configure KSQL such that the KSQL servers are authenticated and authorized, and data communication is encrypted when communicating with the Kafka cluster. We can configure KSQL for:

Read the whole thing if you’re thinking about using Kafka Streams.

Comments closed

Streaming ETL In Practice Using KSQL

Published 2018-02-20 by Kevin Feasel

Robin Moffatt builds an example of streaming ETL using Oracle, GoldenGate, and Kafka:

So in this post I’m going to show an example of what streaming ETL looks like in practice. I’m replacing batch extracts with event streams, and batch transformation with in-flight transformation of these event streams. We’ll take a stream of data from a transactional system built on Oracle, transform it, and stream it into Elasticsearch to land the results to, but your choice of datastore is up to you—with Kafka’s Connect API you can stream the data to almost anywhere! Using KSQL we’ll see how to filter streams of events in real-time from a database, how to join between events from two database tables, and how to create rolling aggregates on this data.

It’s a very useful example.

Comments closed

User-Defined Functions In KSQL

Published 2018-02-07 by Kevin Feasel

Kai Waehner demonstrates building a user-defined function for Kafka Streams:

As you can see, the full implementation is just a few lines of Java code. In general, you need to implement the logic between receiving input and returning output of the UDF in the evaluate()method. You also need to implement exception handling (e.g. invalid input arguments) where applicable. The init() method is empty in this case, but could initialise any required object instances.

Note that this UDF has state: dateFormat can be null or already initialized. However, no worries. You do not have to manage the scope as Kafka Streams (and therefore KSQL) threads are independent of each other. So this won’t cause any issues.

Click through for the entire process.

Comments closed

Hortonworks DataFlow 3.1 Released

Published 2018-02-05 by Kevin Feasel

George Vetticaden and Haimo Liu announce Hortonworks DataFlow version 3.1:

Apache Kafka 1.0 support with full integration with HDF Services – Kafka 1.0 provides important new features including more stringent message processing semantics with support for message headers and transactions, performance improvements and advanced security options.

Apache Ambari support for Kafka 1.0 – Install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari.
Apache Ranger support for Kafka 1.0 – Manage access control policies (ACLs) using resource or tag-based security for Kafka 1.0 clusters.
New NiFi and SAM processors for Kafka 1.0 – New processors in NiFi and Hortonworks Streaming Analytics Manager (SAM) support Kafka 1.0 features including message headers and transactions.

Click through for the list of top changes.

Comments closed

KSQL 0.4 Released

Published 2018-02-02 by Kevin Feasel

Apurva Mehta announces the release of KSQL 0.4:

The SHOW TOPICS command has been enhanced to include the number of active consumers and also the number of active consumer groups which are reading the topics.

Consumer groups are a feature of Apache Kafka which enable multiple consumer processes to divide the work of consuming Kafka topic. You can learn more about them in the Kafka Consumer JavaDocs, and of course you should read the SHOW TOPICS documentation for more information.

Read on for the full set of changes.

Comments closed

Subtle Changes In Application Behavior Across Kafka Streams Versions

Published 2018-01-29 by Kevin Feasel

Aleksandar Pejakovic shows some subtle but important changes to an application running Kafka Streams 0.11 versus 1.0:

One would expect that by changing the version, the previous behavior would remain the same. Well, it hasn’t. What has changed?

After each process method, a punctuate method is called. After punctuateInterval is scheduled, punctuate also occurs. This means the following:

In the first test scenario, each “Arrived: message_<offset>” message in the console is accompanied with “Punctuate call”. Unsurprisingly, we have one: “Processed: 1” message in output topic. After ten messages, we have another: “Punctuate call” and “Processed: 0” pair.

In the second scenario, we have nine: “Arrived: message_<offset>” and “Punctuate call” pairs on the console, followed with 9: “Processed: 1” in the output topic. After the pause and tenth message we have: “Arrived: message_<offset>” and 3 “Punctuate call”. In the output topic, we see “Processed: 1”, “Processed: 0”, and “Processed 0”.

Read the whole thing. This sort of behavioral change can be hard to suss out when testing a streaming application.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31