Press "Enter" to skip to content

Category: Hadoop

Late-Arriving Data with Spark Streaming

Sarfaraz Hussain continues a series on Spark streaming:

The size of the State (discussed in the previous blog) will continue to increase indefinitely as we really don’t know when a bucket can be closed.

But practically a query is not going to receive data 1 week late or in that matter such late-arriving data is of no use to us.

So, to specify the information when to stop considering older buckets for the streaming query we use Watermark.

Read on to see how you can design and implement a watermark.

Comments closed

Processing Larger Messages with Apache Kafka

Kai Wähner walks us through the tradeoffs of sending large messages in Apache Kafka:

After exploring use cases for large message payloads, let’s clarify what Kafka is not:

Kafka is usually not the right technology to store and process large files (images, videos, proprietary files, etc.) as a whole. Products were built specifically for these use cases.

For instance, a Content Delivery Network (CDN) such as Akamai, Limelight Networks, or Amazon CloudFront distribute video streams and other software downloads across the globe. Or “big file editing and processing” (like a video processing tool). Or video editing tools from Adobe, Autodesk, Camtasia, and many other vendors are used to structure and present all video information, including films and television shows, video advertisements, and video essays.

There’s a lot of good advice in here. I think the best advice is essentially “don’t do this unless you need it” but I appreciate that Kai goes a lot further than that.

Comments closed

Stateful Stremaing with Spark

Sarfaraz Hussain continues a series on Spark Streaming:

Structured Streaming does processing under the hood as micro-batches (default nature).

state is versioned between micro-batches while the streaming query runs. So as the series of incremental execution plans are generated (discussed in Part 2), each execution plan knows what version of the state it needs to read from.

Each micro-batch reads the previous version of the state data i.e. the previous running count then updates it and creates a new version. Each of these versions gets check-pointed into the same check-point location that we have provided in the query.

Read on to understand the implications of this and what it allows you to do.

Comments closed

Multi-Threaded Message Consuption with Kafka

Igor Buzatovic takes us through a fairly advanced topic in Apache Kafka:

If you are familiar with basic Kafka concepts, you know that you can parallelize message consumption by simply adding more consumers in the same group. However, that approach is more suitable for horizontal scaling where you add new consumers by adding new application nodes (containers, VMs, and even bare metal instances).

A multi-consumer approach can also be used for vertical scaling, but this requires additional management of consumer instances and accompanying consuming threads in the application code. Using multiple consumer instances introduces additional network traffic as well as more work for the consumer group coordinator since it has to manage more consumers.

While these concerns may not be strong enough reasons for switching from a thread per consumer to a multi-threaded model, there are use cases in which a multi-threaded model has compelling advantages.

Read the whole thing.

Comments closed

An Introduction to Spark Streaming

Sarfaraz Hussain has started a series on Spark Streaming. The first post gives an introduction to the topic:

The philosophy behind the development of Structured Streaming is that,

We as end user should not have to reason about streaming”.

What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.

Sarfaraz then follows this up with a bit on the structure of a streaming query:

So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.

It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.

Check them both out.

Comments closed

Learning from a Hadoop Outage

Sandhya Ramu and Vasanth Rajamani have an after-action report:

For companies and organizations, failure tends to be far more illuminating than success and the lingering effects of a failure can be harmful if the team moves too quickly and does not resolve the issue in a thorough and transparent manner. We recently ran into a large incident that involved data loss in our big data ecosystem and by reflecting on our diagnosis and response, we hope that our learnings from an impactful incident in our big data ecosystem will be insightful.

Here’s what happened: roughly 2% of the machines across a handful of racks were inadvertently reimaged. This was caused by procedural gaps in our Hadoop infrastructure’s host life cycle management. Compounding our woes, the incident happened on our business-critical production cluster.

Read on to understand what happened and why. It’s a lesson in the importance of having a disaster recovery plan and testing it

Comments closed

What’s New in Kafka 2.6?

Randall Hauch takes us through the changes in Apache Kafka 2.6:

We’ve made quite a few significant performance improvements in this release, particularly when the broker has larger partition counts. Broker shutdown performance is significantly improved, and performance is dramatically improved when producers use compression. Various aspects of ACL usage are faster and require less memory. And we’ve reduced memory allocations in several other places within the broker.

This release also adds support for Java 14. And over the past few releases, the community has switched to using Scala 2.13 by default and now recommends using Scala 2.13 for production.

Finally, these accomplishments are only one part of a larger active roadmap in the run up to Apache Kafka 3.0, which may be one of the most significant releases in the project’s history. 

Read on to learn more.

Comments closed

HIVE-6384 Errors with Spark and Parquet

Manoj Pandey troubleshoots an issue:

But I was getting following error:

warning: there was one feature warning; re-run with -feature for details
java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384

 
As per the above error it relates to some Hive version conflict, so I tried checking the Hive version by running below command and found that it is pointing to an old version (0.13.0). This version of Hive metastore did not support the BINARY datatypes for parquet formatted files.

Read on to see how Manoj was able to fix the problem in Azure Databricks.

Comments closed

Kafka Integration with Knime

Swantika Gupta shows off some of Knime’s ability to integrate with Apache Kafka:

Knime Analytics Platform provides it’s users a way to consume messages from Apache Kafka and publish the transformed results back to Kafka. This allows the users to integrate their knime workflows easily with a distributed streaming pub-sub mechanism.

With Knime 3.6 +, the users get a Kafka extension with three new nodes:
1. Kafka Connector
2. Kafka Consumer
3. Kafka Producer

Click through to see how to configure each and how to enrich your data with Knime.

Comments closed