Spark Streaming On Azure Databricks

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics

  • Spark Streaming / Storm on HDInsight

  • Spark Streaming on Databricks

  • Azure Functions

Click through for more.

Monitoring Apache NiFi With A Custom Dashboard

Tim Spann has started a new series on monitoring Apache NiFi:

In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.

There are also other data access options including the NiFi REST API and the NiFi Python APIs.

Boostrap Notifier

  • Send notification when the NiFi starts, stops or died unexpectedly
  • Two OOTB notifications
  • Email notification service
  • HTTP notification service
  • It’s easy to write a custom notification service

Reporting Tasks

  • AmbariReportingTask (global, per process group)

  • MonitorDiskUsage(Flowfile, content, provenance repositories)

  • MonitorMemory

Much of this is an overview of the tools and measures available.

Troubleshooting KSQL Executions

Robin Moffatt shows us some of the tools available for researching problems with KSQL queries executed against a server:

What does any self-respecting application need? Metrics! We need to know how many messages have been processed, when the last message was processed and so on.

The simplest option for gathering these metrics comes from within KSQL itself, using the same DESCRIBE EXTENDED command that we saw before:

ksql> DESCRIBE EXTENDED GOOD_RATINGS;
[...]
Local runtime statistics
------------------------
messages-per-sec:      1.10 total-messages:     2898 last-message: 9/17/18 1:48:47 PM UTC failed-messages:         0 failed-messages-per-sec:         0 last-failed: n/a
(Statistics of the local KSQL server interaction with the Kafka topic GOOD_RATINGS)
ksql>

You can get more details, including explain plans, from this.  There are external tools which Robin demonstrates as well, which let you track the streams over time.

Scaling Anomaly Detection With Kafka And Cassandra

Paul Brebner has started a series on anomaly detection using Kafka and Cassandra, starting with an introduction:

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

  1. Large volumes of streaming data is ingested into Kafka (at variable rates)

  2. Data is sent to Cassandra for long term persistence

  3. Streams processing is triggered by the incoming events in real-time

  4. Historic data is requested from Cassandra

  5. Historic data is retrieved from Cassandra

  6. Historic data is processed, and

  7. A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection.  I’ll be interested in following this series.

Troubleshooting KSQL

Robin Moffatt walks us through a few scenarios where KSQL queries aren’t returning any data:

Probably the most common question in the Confluent Community Slack group’s #ksql channel is:

Why isn’t my KSQL query returning data?

That is, you’ve run a CREATE STREAM, but when you go to query it…

ksql> SELECT * FROM MY_FIRST_KSQL_STREAM;

…nothing happens. And because KSQL queries are continuous, your KSQL session appears to “hang.” That’s because KSQL is continuing to wait for any new messages to show you. So if your run a KSQL SELECT and get no results back, what could be the reasons for that?

Robin gives us five reasons why this might be.

Hadoop + SQL Server In 2019

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Writing To Elasticsearch With Spark Streaming

Anuj Saxena has an example of writing data from a Spark Streaming pipeline out to Elasticsearch:

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in another blog. But today we’ll focus on saving streaming data to Elasticseach using Spark Structured Streaming. Elasticsearch added support for Spark Structured Streaming 2.2.0 onwards in version 6.0.0 version of “Elasticsearch For Apache Hadoop” dependency. We will be using these versions or higher to build our sbt-scala project.

Click through for an example.

It’s All ETL (Or ELT) In The End

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world:

In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.

  • Extract: Data is available from a source system
  • Transform: We want to filter, cleanse or otherwise enrich this source data
  • Load: Make the data available to another application

There are two key concepts here:

  • Data is created by an application, and we want it to be available to other applications
  • We often want to process the data (for example, cleanse and apply business logic to it) before it is used

Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?

Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now.  The set of tools expand, but the nature of the job remains similar.

Kalman Filters With Spark And Kafka

Konur Unyelioglu goes deep into Kalman filters:

In simple terms, a Kalman filter is a theoretical model to predict the state of a dynamic system under measurement noise. Originally developed in the 1960s, the Kalman filter has found applications in many different fields of technology including vehicle guidance and control, signal processing, transportation, analysis of economic data, and human health state monitoring, to name a few (see the Kalman filter Wikipedia page for a detailed discussion). A particular application area for the Kalman filter is signal estimation as part of time series analysis. Apache Spark provides a great framework to facilitate time series stream processing. As such, it would be useful to discuss how the Kalman filter can be combined with Apache Spark.

In this article, we will implement a Kalman filter for a simple dynamic model using the Apache Spark Structured Streaming engine and an Apache Kafka data source. We will use Apache Spark version 2.3.1 (latest, as of writing this article), Java version 1.8, and Kafka version 2.0.0. The article is organized as follows: the next section gives an overview of the dynamic model and the corresponding Kalman filter; the following section will discuss the application architecture and the corresponding deployment model, and in that section we will also review the Java code comprising different modules of the application; then, we will show graphically how the Kalman filter performs by comparing the predicted variables to measured variables under random measurement noise; we’ll wrap up the article by giving concluding remarks.

This is going on my “reread carefully” list; it’s very interesting and goes deep into the topic.

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta:

The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the digital ad (impression) while the conversion stream denotes customers who have performed an action (e.g. click the ad, purchased an item, etc.) based on that ad.

With Structured Streaming in Databricks, you can quickly plug into the stream as Databricks supports direct connectivity to Kafka (Apache KafkaApache Kafka on AWSApache Kafka on HDInsight) and Kinesis as noted in the following code snippet (this is for impressions, repeat this step for conversions)

This is definitely an interesting approach to the problem.  Check it out.

Categories

November 2018
MTWTFSS
« Oct  
 1234
567891011
12131415161718
19202122232425
2627282930