Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Avro Schemas In Kafka

Stephane Maarek explains the value of using Apache Avro as a schema structure for your Kafka topics:

  • Avro has support for primitive types ( intstringlongbytes, etc…), complex types (enumarraysunions, optionals), logical types (datestimestamp-millisdecimal), and data record (name and namespace). All the types you’ll ever need.

  • Avro has support for embedded documentation. Although documentation is optional, in my workflow I will reject any Avro Schema PR (pull request) that does not document every single field, even if obvious. By embedding documentation in the schema, you reduce data interpretation misunderstandings, you allow other teams to know about your data without searching a wiki, and you allow your devs to document your schema where they define it. It’s a win-win for everyone.

  • Avro schemas are defined using JSON. Because every developer knows or can easily learn JSON, there’s a very low barrier to entry

Read on for more about Avro as well as the possibilities of using other techniques for defining schemas in Kafka.

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL:

We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.

But the main question, as usual, is: Will SparkSQL queries and HiveQL queries blend?

Topic: Age distribution for men and women in the U.S. state of Maine.

Challenge: Blend results from Hive SQL and Spark SQL queries.

Access mode: Apache Spark and Apache Hive nodes for SQL processing.

Using KNIME, the authors are able to blend together data from different sources.

Kafka Streams And Time-Based Batching

Vladimir Vajda provides a warning for people using Kafka Streams:

To completely understand the problem, we will first go into detail how ingestion and processing occur by default in Kafka Streams. For example purposes, the punctuate method is configured to occur every ten seconds, and in the input stream, we have exactly one message per second. The purpose of the job is to parse input messages, collect them, and, in the punctuate method, do a batch insert in the database, then to send metrics.

After running the Kafka Stream application, the Processor will be created, followed by the initmethod. Here is where all the connections are established. Upon successful start, the application will listen to input topic for incoming messages. It will remain idle until the first message arrives. When the first message arrives, the process method is called — this is where transformations occur and where the result is stored for later use. If no messages are in the input topic, the application will go idle again, waiting for the next message. After each successful process, the application checks if punctuate should be called. In our case, we will have ten process calls followed by one punctuate call, with this cycle repeating indefinitely as long as there are messages.

A pretty obvious behavior, isn’t it? Then why is one bolded?

Read on for more, including how to handle this edge case.

Kafka And GDPR

Ben Stopford has some ideas for using Kafka in a GDPR world:

The simplest way to remove messages from Kafka is to simply let them expire. By default, Kafka will keep data for two weeks, and you can tune this to arbitrarily large periods of time as required. There is also an Admin API that lets you delete messages explicitly if they are older than some specified time or offset. But what if we are keeping data in the log for a longer period of time, say for Event Sourcing architectures or as a source of truth? For this, you can make use of compacted topics, which allow messages to be explicitly deleted or replaced by key.

Data isn’t removed from compacted topics in the same way as in a relational database. Instead, Kafka uses a mechanism closer to those used by Cassandra and HBase where records are marked for removal then later deleted when the compaction process runs. Deleting a message from a compacted topic is as simple as writing a new message to the topic with the key you want to delete and a null value.  When compaction runs the message will be deleted forever.

Click through for more information.

Hadoop 3.0 Is Coming

Alex Woodie reports that Hadoop 3.0 will likely drop before Christmas:

After years of work, the Apache Hadoop community is now putting the finishing touches on a release candidate for Hadoop 3.0 and, barring any unforeseen occurrences, will deliver it by the middle of December, according to Vinod Kumar Vavilapalli, a committer on the Apache Hadoop project and director of engineering at Hortonworks.

“We can’t set the dates in stone, but it’s looking like we’ll get something out by mid-December,” Vavilapalli told Datanami in an interview last week.

Read on for some of the bigger changes that come with this.

Impala Now A Top-Level Project

Greg Rahn announces that Apache Impala is now a top-level project:

Five years ago, Cloudera shared with the world our plan to transfer the lessons from decades of relational database research to the Apache Hadoop platform via a new SQL engine — Apache Impala — the first and fastest open source MPP SQL engine for Hadoop.  Impala enabled SQL users to operate on vast amounts of data in open formats, stored on HDFS originally (with Apache Kudu, Amazon S3, and Microsoft ADLS now also native storage options), and do so in an interactive and iterative manner, which was previously not possible.  Its flexibility and leading analytic database performance drove the strong adoption of Impala across a wide range of global enterprises looking to power these BI and SQL analytic workloads, and led to a constantly growing ecosystem of third-party tools integrating with Impala.

Fast forward three years, Cloudera donated Impala to the Apache Software Foundation, along with the newly announced Apache Kudu project, further solidifying its place in the open source SQL world.  Since the proposal, the Impala engineering team has worked hard to bring Impala to the new software governance model of the Apache Incubator and build an active and innovative community. That’s why we are pleased to announce that Impala has graduated to a Top-Level Apache Software Foundation Project.

Congratulations go out to Cloudera and everyone who has worked on Imapala over the years.

Functional Programming And Microservices

Bobby Calderwood might win me over on microservices with talk like this:

This view of microservices shares much in common with object-oriented programming: encapsulated data access and mutable state change are both achieved via synchronous calls, the web of such calls among services forming a graph of dependencies. Programmers can and should enjoy a lively debate about OO’s merits and drawbacks for organizing code within a single memory and process space. However, when the object-oriented analogy is extended to distributed systems, many problems arise: latency which grows with the depth of the dependency graph, temporal liveness coupling, cascading failures, complex and inconsistent read-time orchestration, data storage proliferation and fragmentation, and extreme difficulty in reasoning about the state of the system at any point in time.

Luckily, another programming style analogy better fits the distributed case: functional programming. Functional programming describes behavior not in terms of in-place mutation of objects, but in terms of the immutable input and output values of pure functions. Such functions may be organized to create a dataflow graph such that when the computation pipeline receives a new input value, all downstream intermediate and final values are reactively computed. The introduction of such input values into this reactive dataflow pipeline forms a logical clock that we can use to reason consistently about the state of the system as of a particular input event, especially if the sequence of input, intermediate, and output values is stored on a durable, immutable log.

It’s an interesting analogy.

Running PySpark In Visual Studio Code

Jenny Jiang shows how to run PySpark on HDInsight in VSCode:

We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. For PySpark developers who value productivity of Python language, VSCode HDInsight Tools offer you a quick Python editor with simple getting started experiences, and enable you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights.

Click through to see how it’s done.

HDFS Federation

Kevin Feasel



Sangeeta Gulia explains what HDFS Federation is and how it differs from classic HDFS:

HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.

Namenodes are federated, that is, all these NameNodes work independently and don’t require any coordination with each other.

It’s one way to reduce the number of potential single points of failure in a Hadoop environment.


December 2017
« Nov