The 2 main types of stream processing done are:
1. Stateless: Where every event is handled completely independent from the preceding events.
2. Stateful: Where a “state” is shared between events and therefore past events can influence the way current events are processed.
Stateless stream processing is easy to scale up because events are processed independently. But Stateful stream processing is difficult to scale up because the “state” needs to be shared across the events.
Himanshu does point out alternatives, but this isn’t a comparison exercise.
This post is about something new I have tried last week. The goal was to create simulated streaming data source, feed it into Power BI as a streaming dataset, create a report out of the streaming dataset, and then embed it to an web application. With proper directions provided by my teammates, I finished the implementation from end to end within 1.5 hours. I was super impressed by how awesome it is and how easy it is to implement so that I want to share those directions to you.
The source data is simulated but the process is the same with real data sets.
In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.
To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
188.8.131.52 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
Click through for the demonstration.
Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more.
If you were to go Kafka-first… well, it isn’t impossible. But all those responsibilities now belong to you as a developer. And if you are thinking there may be multiple webservers handling user requests and passing them to Kafka, you have to solve fairly challenging problems.
Read the whole thing.
Service (application) exposes the JMX metrics at some port which will be captured by Jolokia java agent. Then Jolokia exposes those metrics at some port which is easily accessible through a rest endpoint (we call it Jolokia URL). Then we have JMX2Graphte which polls the metrics from Jolokia URL and push it to Graphite. Then Grafana reads the Graphite metrics and creates a beautiful dashboard for us along with the alerts.
So this is the working of the proposed monitoring solution. Now let’s discuss the components of the monitoring solution.
There’s a bit of code/configuration in here as well, so check it out.
The UDF I want to implement here is DATEADD. If you’re familiar with SQL, you have definitely used it: it takes in a date and adds or subtracts a specific number value to a specific part of datetime, and spits out a new datetime.
To implement a User Defined Function (UDF or UDAF) you would need to code your function in Java and then import the jar file in your KSQL server. You can read about the full process here, I point out a couple of things that I believe you should pay attention to:
I think I prefer Spark’s method for UDFs in Spark SQL: create it in Scala and register the function.
Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.
As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!
For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.
This Python script ran on a single machine, and is from the early days of the company. However, this script didn’t scale since it cannot run in a distributed manner. As a result, this Python job ends up flapping—crashing and restarting regularly in production depending on the load it needs to process.
Second, the Python script puts read pressure on MongoDB and Cassandra, because it has to query the databases for each batch of walk-ins and Zenreach Messages. MongoDB and Cassandra are our primary databases for serving customer read queries. So we wanted to remove the additional read pressure added by this job, which currently competes for resources with our customers.
For these reasons, we wanted to move to a streaming solution—specifically, Kafka Streams. We already switched to Kafka Streams for walk-in detection, which my teammate Eugen Feller explained in a previous post.
Click through for a review of the architecture and some tips if you want to do this yourself.
We understand that streaming data isn’t typically considered “Data Science” by itself. However, it’s are often associated and setting up this background now opens up some cool applications in later posts. For this post, we’ll cover how to sink streaming data to Power BI using Stream Analytics.
The previous posts in this series used Power BI Desktop for all of the showcases. This post will be slightly different in that we will leverage the Power BI Service instead. The Power BI Service is a collaborative web interface that has most of the same reporting capabilities as Power BI Desktop, but lacks the ability to model data at the time of writing. However, we have heard whispers that data modeling capabilities may be coming to the service at some point. The Power BI Service is also the standard method for sharing datasets, reports and dashboards across organizations. For more information on the Power BI Service, read this.
Brad has a nice demo, so check it out.
When you create a stream processing application with Kafka’s Streams API, you create a
Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the
KafkaStreamsclass, which connects to a Kafka cluster and begins processing when you call
start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.
Instead, developers can unit test their Kafka Streams applications with utilities provided by
kafka-streams-test-utils. Introduced in KIP-247, this artifact was specifically created to help developers test their code, and it can be added into your continuous integration and continuous delivery (CI/CD) pipeline.
Streaming applications need tested just like any other.