This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.
The following tools will be used:
Git—to manage and clone source code
Docker—to run some services in containers
Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala
Maven 3—to compile the code we write
Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial
Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.
The Hello World of streaming apps is a Twitter client.
This is pretty cool – the
update_tscolumn is managed automagically by MySQL (other RDBMS have similar functionality), and Kafka Connect’s JDBC connector is using this to pick out new and updated rows from the database.
As a side note here, Kafka Connect tracks the offset of the data that its read using the
connect-offsetstopic. Even if you delete and recreate the connector, if the connector has the same name it will retain the same offsets previously stored. So if you want to start from scratch, you’ll want to change the connector name – for example, use an incrementing suffix for each test version you work with. You can actually check the content of the connect-offsets topic easily:
This is part 1 of a mini-series, but does show you how to build connections to stream data from MySQL into Kafka and then into a flat file.
When we need to process streams of real-time data, Storm is a great contender. Examples of streaming data are the number of consumer clicks and navigations on a website, IIS or user logs, IoT data, and social network information. In all these scenarios, we use real-time data processing. Apache Storm can process real-time unbounded streams of data.
The term “unbounded” defines streams of data with no start or end. Here, the processing of data is continuous and in real-time. Twitter is a good example. Twitter data is continuous, has no start or end time, and is provided in real-time by millions of Twitter users around the world.
Storm wouldn’t rank in my top three technologies for doing this, but it certainly does the job.
Here, we are using Kafka streams in our applications. We are done with the implementation but again, the most important thing left is testing. This blog is about how to test the application we have created. For this, I’ll be taking the sample app I created in my previous blog for both high-level DSL and low-level processor API.
Traditionally, we test our Kafka application with an integration test for which we need to create a ZooKeeper and a real Kafka broker. After that, we need a mock producer and mock consumer for our app to produce the inputs and receive the outputs. That makes it such a big hassle just to test our app. Testing it for real scenarios and for the actual integration test, this is needed without a doubt.
Click through for an example.
You can push data to the Power BI streaming dataset API in a few ways… but they generally boil down to these 3 options…
- Directly call the API from code
- You could use something like Azure Function Apps to iteratively pull NEW rows that land in a SQL table, create the API Call, and push the new data directly to the API
- See here info on Azure Functions – https://azure.microsoft.com/en-us/services/functions/
- Directly call the API from an Azure Logic App
- Azure Logic Apps are cool as for simple functions like this you can do pretty much the same as in the code option above but just using drag/drop and WITHOUT writing any code – https://azure.microsoft.com/en-us/services/logic-apps/
- Use Azure Stream Analytics to push data into the API
- This is leveraging the solution in my previous post to push data from SQL CDC and into an Azure Event Hub, then via Azure Stream Analytics to Power BI
- Azure Stream Analytics – https://azure.microsoft.com/en-us/services/stream-analytics/
- Azure Event Hubs – https://azure.microsoft.com/en-us/services/event-hubs/
This blog post extends on my previous post – and thus I will be leveraging Option #3 above.
Definitely worth checking out if you are interested in real-time Power BI dashboards.
The features provided by Kafka Streams:
Highly scalable, elastic, distributed, and fault-tolerant application.
Stateful and stateless processing.
Event-time processing with windowing, joins, and aggregations.
We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us to define and connect custom processors.
Low barrier to entry, which means it does not take much configuration and setup to run a small scale trial of stream processing; the rest depends on your use case.
No separate cluster requirements for processing (integrated with Kafka).
Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records.
Supports Kafka Connect to connect to different applications and databases.
Read on for more details as well as a sample script to get started.
The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub. Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.
What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!
There are a bunch of steps, but the end result is worth it.
Let’s consider an application that does some real-time stateful stream processing with the Kafka Streams API. We’ll run through a specific example of the end-to-end reference architecture and show you how to:
Run a Kafka source connector to read data from another system (a SQLite3 database), then modify the data in-flight using Single Message Transforms (SMTs) before writing it to the Kafka cluster
Process and enrich the data from a Java application using the Kafka Streams API (e.g. count and sum)
Run a Kafka sink connector to write data from the Kafka cluster to another system (AWS S3)
Read the whole thing.
The most notable new feature is Exactly Once Semantics (EOS). Kafka’s EOS capabilities provide more stringent idempotent producer semantics with exactly once, in-order delivery per partition, and stronger transactional guarantees with atomic writes across multiple partitions. Together, these strong semantics make writing applications easier and expand Kafka’s addressable use cases. You can learn more about EOS in the online talk on June 29, 2017.
“Exactly once,” if done right, would be crazy—there’s a reason most brokers are either “at least once” or “best effort.”
The low latency and an easy-to-use event time support also apply to Kafka Streams. It is a rather focused library, and it’s very well-suited for certain types of tasks. That’s also why some of its design can be so optimized for how Kafka works. You don’t need to set up any kind of special Kafka Streams cluster, and there is no cluster manager. And if you need to do a simple Kafka topic-to-topic transformation, count elements by key, enrich a stream with data from another topic, or run an aggregation or only real-time processing — Kafka Streams is for you.
If event time is not relevant and latencies in the seconds range are acceptable, Spark is the first choice. It is stable and almost any type of system can be easily integrated. In addition it comes with every Hadoop distribution. Furthermore, the code used for batch applications can also be used for the streaming applications as the API is the same.
Read on for more analysis.