Category: Streaming

Comparing Streaming Engines

Published 2018-12-11 by Kevin Feasel

George Vetticaden compares Spark Streaming, Storm, and Kafka Streams:

Before the addition of Kafka Streams support, HDP and HDF supported two stream processing engines: Spark Structured Streaming and Streaming Analytics Manager (SAM) with Storm. So naturally, this begets the following question:
Why add a third stream processing engine to the platform?
With the choice of using Spark structured streaming or SAM with Storm support, customers had the choice to pick the right stream processing engine based on their non- functional requirements and use cases. However, neither of these engines addressed the following types of requirements that we saw from our customers:

And this doesn’t even include Samza or Flink, two other popular streaming engines.

My biased answer is, forget Storm. If you have a legacy implementation of it, that’s fine, but I wouldn’t recommend new streaming implementations based off of it. After that, you can compare the two competitors (as well as Samza and Flink) to see which fits your environment better. I don’t think either of these has many scenarios where you completely regret going with, say, Kafka Streams instead of Spark Streaming. Each has its advantages, but they’re not so radically different.

Comments closed

Using Databricks Delta In Lieu Of Lambda Architecture

Published 2018-12-10 by Kevin Feasel

Jose Mendes contrasts the Lambda architecture with the Databricks Delta architecture and gives us a quick example of using Databricks Delta:

The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, it is now possible to overcome such limitation if we have the possibility to change our approach.
Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Databricks Delta is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.

It’s an interesting contrast and I recommend reading the whole thing.

Comments closed

Configuring Kafka Streams For Least Privilege

Published 2018-12-05 by Kevin Feasel

Gwen Shapira explains how we can assign minimal rights to Kafka Streams and KSQL users:

The principle of least privilege dictates that each user and application will have the minimal privileges required to do their job. When applied to Apache Kafka^® and its Streams API, it usually means that each team and application will have read and write access only to a selected few relevant topics.

Organizations need to balance developer velocity and security, which means that each organization will likely have their own requirements and best practices for access control.

There are two simple patterns you can use to easily configure the right privileges for any Kafka Streams application—one provides tighter security, and the other is for more agile organizations. First, we’ll start with a bit of background on why configuring proper privileges for Kafka Streams applications was challenging in the past.

Read the whole thing; “granting everybody all rights” generally isn’t a good idea, no matter what your data platform of choice may be.

Comments closed

Apache Samza At 1.0

Published 2018-11-29 by Kevin Feasel

Jagadish Venkatraman announces Apache Samza 1.0:

We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a distributed stream processing framework that we developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014. Fast-forward to 2018, and we currently have over 3,000 applications in production leveraging Samza at LinkedIn. The use-cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more. Today, Samza integrates not only with Apache Kafka, but also with many other systems, including Azure EventHubs, Amazon Kinesis, HDFS, ElasticSearch, and Brooklin. Multiple companies like Slack, TripAdvisor, eBay, and Optimizely have adopted Samza.

We view Samza 1.0 as a step towards our vision of making stream processing universally accessible. In this post, we describe our journey in building and scaling a distributed stream processing system. We also present the key features in Samza 1.0: a rich high-level API, event-time-based processing, integration with Apache Beam, Samza SQL, a standalone mode to run Samza without YARN, and a new test framework for Samza applications.

This runs in the same space as Spark Streaming, Flink, and Kafka Streams, so there are plenty of competitors and a lot of innovation.

Comments closed

Spark Streaming On Azure Databricks

Published 2018-10-25 by Kevin Feasel

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

Azure Stream Analytics
Spark Streaming / Storm on HDInsight
Spark Streaming on Databricks
Azure Functions

Click through for more.

Comments closed

Monitoring Apache NiFi With A Custom Dashboard

Published 2018-10-23 by Kevin Feasel

Tim Spann has started a new series on monitoring Apache NiFi:

In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.

There are also other data access options including the NiFi REST API and the NiFi Python APIs.

Boostrap Notifier

Send notification when the NiFi starts, stops or died unexpectedly

Two OOTB notifications

Email notification service

HTTP notification service

It’s easy to write a custom notification service

Reporting Tasks

AmbariReportingTask (global, per process group)
MonitorDiskUsage(Flowfile, content, provenance repositories)
MonitorMemory

Much of this is an overview of the tools and measures available.

Comments closed

Troubleshooting KSQL Executions

Published 2018-10-05 by Kevin Feasel

Robin Moffatt shows us some of the tools available for researching problems with KSQL queries executed against a server:

What does any self-respecting application need? Metrics! We need to know how many messages have been processed, when the last message was processed and so on.

The simplest option for gathering these metrics comes from within KSQL itself, using the same DESCRIBE EXTENDED command that we saw before:
ksql> DESCRIBE EXTENDED GOOD_RATINGS;
[...]
Local runtime statistics
------------------------
messages-per-sec:      1.10 total-messages:     2898 last-message: 9/17/18 1:48:47 PM UTC
 failed-messages:         0 failed-messages-per-sec:         0 last-failed: n/a
(Statistics of the local KSQL server interaction with the Kafka topic GOOD_RATINGS)
ksql>

You can get more details, including explain plans, from this. There are external tools which Robin demonstrates as well, which let you track the streams over time.

Comments closed

Scaling Anomaly Detection With Kafka And Cassandra

Published 2018-10-01 by Kevin Feasel

Paul Brebner has started a series on anomaly detection using Kafka and Cassandra, starting with an introduction:

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

Large volumes of streaming data is ingested into Kafka (at variable rates)
Data is sent to Cassandra for long term persistence
Streams processing is triggered by the incoming events in real-time
Historic data is requested from Cassandra
Historic data is retrieved from Cassandra
Historic data is processed, and
A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection. I’ll be interested in following this series.

Comments closed

Troubleshooting KSQL

Published 2018-09-28 by Kevin Feasel

Robin Moffatt walks us through a few scenarios where KSQL queries aren’t returning any data:

Probably the most common question in the Confluent Community Slack group’s #ksql channel is:

Why isn’t my KSQL query returning data?

That is, you’ve run a CREATE STREAM, but when you go to query it…
ksql> SELECT * FROM MY_FIRST_KSQL_STREAM;

…nothing happens. And because KSQL queries are continuous, your KSQL session appears to “hang.” That’s because KSQL is continuing to wait for any new messages to show you. So if your run a KSQL SELECT and get no results back, what could be the reasons for that?

Robin gives us five reasons why this might be.

Comments closed

Hadoop + SQL Server In 2019

Published 2018-09-27 by Kevin Feasel

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Comments closed