Press "Enter" to skip to content

Category: Hadoop

Improving Join Performance on ADF Data Flows

Mark Kromer has a few tips on improving ADF data flow join performance:

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.

In other words, avoid this for your join condition:source1@movieId == '1'Instead, implement that with a dummy derived column. 

There are several good tips in this post.

Comments closed

Log Aggregation with Apache Flink

Gyula Fora and Matyas Orhidi have started a series on log aggregation with Apache Flink:

There are several off-the-shelf solutions available on the market for log aggregation, which come with their own stack of components and operational difficulties. For example, notable logging frameworks that are widely used in the industry are ELK stack and Graylog. 

Unfortunately, there is no clear cut solution that works for every application, and different logging solutions might be more suitable for certain use cases. The log processing of real-time applications should for instance also happen in real-time, otherwise, we lose timely information that may be required to successfully operate the system.

In this blog post, we dive deep into logging for real-time applications.

This post is mostly understanding and setup, but it leads into processing and visualization.

Comments closed

Fraud Detection with Flink

Alexander Fedulov gives us a case study of using Apache Flink for fraud detection:

In this blog post, we have discussed the motivation behind supporting dynamic, runtime changes to a Flink application by looking at a sample use case – a Fraud Detection engine. We have described the overall architecture and interactions between its components as well as provided references for building and running a demo Fraud Detection application in a dockerized setup. We then showed the details of implementing a dynamic data partitioning pattern as the first underlying building block to enable flexible runtime configurations.

To remain focused on describing the core mechanics of the pattern, we kept the complexity of the DSL and the underlying rules engine to a minimum. Going forward, it is easy to imagine adding extensions such as allowing more sophisticated rule definitions, including filtering of certain events, logical rules chaining, and other more advanced functionality.

It was an interesting discussion and you can grab the code as well.

Comments closed

Streams and Tables in Apache Kafka

Michael Noll wraps up a series on Apache Kafka. First up is the fundamentals of Kafka Streams:

table is a, well, table in the ordinary technical sense of the word, and we have already talked a bit about tables before (a table in Kafka is today more like an RDBMS materialized view than an RDBMS table, because it relies on a change being made elsewhere rather than being directly updatable itself). Seen through the lens of event streaming however, a table is also an aggregated stream. This is a reference to the stream-table duality we discussed in part 1.

In the conclusion, Michael covers a few advanced topics:

Streams and tables are always fault tolerant because their data is stored reliably and durably in Kafka. This should be relatively easy to understand for streams by now as they map to Kafka topics in a straightforward manner. If something breaks while processing a stream, then we just need to re-read the underlying topic again.

For tables, it is more complex because they must maintain additional information—their state—to allow for stateful processing such as joins and aggregations like COUNT() or SUM(). To achieve this while also ensuring high processing performance, tables (through their state stores) are materialized on local disk within a Kafka Streams application instance or a ksqlDB server. But machines and containers can be lost, along with any locally stored data. How can we make tables fault tolerant, too?

This was a nice series.

Comments closed

Using Koalas on Azure Databricks

Ginger Grant shows how you can install the koalas library on an Azure Databricks cluster:

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen.

Click through for a demo of what you need to do.

Comments closed

Flink in Cloudera Streaming Analytics

Dinesh Chandrasekhar announces support for Apache Flink in Cloudera Streaming Analytics:

We cannot hold our excitement anymore! For the last few months, our Data-in-Motion engineering teams have been working hard to deliver a compelling and critical part of our Cloudera DataFlow (CDF) story. To enhance our Stream Processing and Analytics narrative within the overall Data-in-Motion platform, we give you support for Apache Flink with the general availability of Cloudera Streaming Analytics (CSA).

Cloudera Streaming Analytics, powered by Apache Flink, is a new product offering within the Cloudera DataFlow (CDF) platform that provides real-time stateful processing of IoT-scale data streams and complex events for predictive insights. Cloudera DataFlow (as seen in the picture below) is a comprehensive edge-to-cloud real-time streaming data platform. As one of the key pillars of CDF, stream processing & analytics is important for processing millions of data points and complex events coming from various streaming sources. Over the years, we have supported several streaming engines but the addition of Flink now makes CDF an extremely compelling platform for processing high-volumes of streaming data at high-scale. 

This is adding support for Flink; it looks like Spark Streaming and Kafka Streams are also supported, though they are pushing Flink as a first option rather than one among equals.

Comments closed

Databricks Automated Deployment and Testing

Li Yu, et al, explain how to use Databricks notebooks and MLflow to automate deployment and testing of Spark solutions:

Today many data science (DS) organizations are accelerating the agile analytics development process using Databricks notebooks.  Fully leveraging the distributed computing power of Apache Spark™, these organizations are able to interact easily with data at multi-terabytes scale, from exploration to fast prototype and all the way to productionize sophisticated machine learning (ML) models.  As fast iteration is achieved at high velocity, what has become increasingly evident is that it is non-trivial to manage the DS life cycle for efficiency, reproducibility, and high-quality. The challenge multiplies in large enterprises where data volume grows exponentially, the expectation of ROI is high on getting business value from data, and cross-functional collaborations are common.

In this blog, we introduce a joint work with Iterable that hardens the DS process with best practices from software development.  This approach automates building, testing, and deployment of DS workflow from inside Databricks notebooks and integrates fully with MLflow and Databricks CLI. It enables proper version control and comprehensive logging of important metrics, including functional and integration tests, model performance metrics, and data lineage. All of these are achieved without the need to maintain a separate build server.

Read on to see how.

Comments closed

Streams and Tables in Apache Kafka

Michael Noll has started a four-part series. Part one serves as a primer:

In my daily work as a member of Confluent’s Office of the CTO and as the former product manager for ksqlDB and Kafka Streams, I interact with many users of Apache Kafka—be it developers, operators, or architects. Some have a stream processing or Kafka background, some have their roots in relational databases like Oracle and MySQL, and some have neither. But many of them have the same set of technical questions, such as: What’s the difference between an event stream and a database table? Is a Kafka topic the same as a stream? How can I best leverage all these pieces when I want to put my data in Kafka to use?

By the end of this series, you will have answers to each of these common questions and many more. If you are interested to learn about Kafka, I invite you to join me on this journey through Kafka’s core fundamentals!

Part 2 looks at storage fundamentals:

Part 1 of this series discussed the basic elements of an event streaming platform: events, streams, and tables. We also introduced the stream-table duality and learned why it is a crucial concept for an event streaming platform like Apache Kafka®. Here in part 2, we will take a deep dive into Kafka’s storage fundamentals. Notably, we will explore topics and—in my opinion, the most important concept in Kafka: partitions.

We’ll start with the most basic storage question: how do I store data in Kafka?

I’m looking forward to parts 3 and 4.

Comments closed

Schema Evolution in Kafka

The Hadoop in Real World group takes us through schema changes in Apache Kafka:

Meetup.com went live with this new way of distributing RSVPs – that is through Kafka. Both the producer and consumer agrees on the Schema and everything is great. It is silly to think that the schema would stay like that forever. Let’s say meetup.com didn’t feel the value in providing member_id field and removes it. What do you think will happen – will it affect consumers? 

member_id field doesn’t have a default value and it is considered a required column so this change will affect the consumers. When a producer removes a required field, the consumer will see an error something like below –

Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 63
Caused by: org.apache.avro.AvroTypeException: found com.hirw.kafkaschemaregistry.producer.Rsvp,
expecting com.hirw.kafkaschemaregistry.producer.Rsvp, missing required field member_id

This is an interesting review of the schema registry in Kafka and what the different modes allow for.

Comments closed

Static Analysis of Hadoop Libraries

Maxim Stefanov ran a static analysis of several Hadoop libraries and here are the findings:

After the analysis was completed, I chose the most interesting warnings and noticed that I had the same number of warnings in production code and in tests. Normally, I don’t consider analyzer warnings from tests. But when I divided them, I couldn’t leave ‘tests’ warnings unattended. “Why not take a look at them,” I thought, “because bugs in tests might also have adverse consequences.” They can lead to incorrect or partial testing, or even to mishmash. 

After I selected the most intriguing warnings, I divided them by the following groups: production, test and the four main Hadoop modules. And now I’m glad to offer the review of analyzer warnings.  

Read on for the list. Hopefully Maxim submitted a few pull requests or at least Jira tickets for the projects.

Comments closed