Structured Streaming

Matei Zaharia, et al discuss how to use structured streaming in Apache Spark 2.0:

In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).

If you want to learn more about streaming data using Spark, check this out.

Related Posts

Using Databricks Delta In Lieu Of Lambda Architecture

Jose Mendes contrasts the Lambda architecture with the Databricks Delta architecture and gives us a quick example of using Databricks Delta: The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, […]

Read More

Configuring Kafka Streams For Least Privilege

Gwen Shapira explains how we can assign minimal rights to Kafka Streams and KSQL users: The principle of least privilege dictates that each user and application will have the minimal privileges required to do their job. When applied to Apache Kafka® and its Streams API, it usually means that each team and application will have read and […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031