Press "Enter" to skip to content

Category: Hadoop

Spark is Not ACID Compliant

Kundan Kumarr explains how it is that Apache Spark is not ACID compliant:

Atomicity states that it should either write full data or nothing to the data source when using spark data frame writer. Consistency, on the other hand, ensures that the data is always in a valid state.

As evident from the spark documentation below, it is clear that while saving data frame to a data source, existing data will be deleted before writing the new data. But in case of job failure, the original data will be lost or corrupted and no new data will be written.

Click through for an explanation of these two along with a demo, and then an explanation of how Spark Datasets don’t follow the Isolation or Durability properties either. I don’t think any of this is earth-shattering to people, but it is a good reminder that Spark doesn’t fit all use cases.

Leave a Comment

Updates in Confluent Platform 5.4

Tim Berglund takes us through what has changed in Confluent Platform 5.4:

Role-Based Access Control (RBAC)

Back in July, we announced the preview for RBAC as part of the Confluent Platform 5.3 release. After gathering feedback and learning from everyone who tried it out, we are now pleased to announce the availability of RBAC in Confluent Platform 5.4. You can now make use of this feature in production environments with Confluent’s full support.

RBAC offers a centralized security implementation for enabling access to resources across the entire Confluent Platform with just the right level of granularity. You can control permissions you grant to users and groups to specific platform resources, starting at the cluster level and moving all the way down to individual topics, consumers groups, or even individual connectors. You do this by assigning users or groups to roles. This gets you out of the game of managing the individual permissions of a huge number of principals—a real problem for large enterprise deployments.

RBAC delivers comprehensive authorization enforced via all user interfaces (Confluent Control Center UI, CLI, and APIs), and across all Confluent Platform components (Control Center, Schema Registry, REST Proxy, MQTT Proxy, Kafka Connect, and KSQL). Given the distributed architecture not only of Apache Kafka but also of other platform components like Connect and KSQL, having a single framework to centrally manage and enforce security authorizations across all the components is, in a word, essential for managing security at scale.

Click through for several more features and where you can try it out, either on-premises or in a major cloud host.

Leave a Comment

Improving Join Performance on ADF Data Flows

Mark Kromer has a few tips on improving ADF data flow join performance:

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.

In other words, avoid this for your join condition:[email protected] == '1'Instead, implement that with a dummy derived column. 

There are several good tips in this post.

Leave a Comment

Log Aggregation with Apache Flink

Gyula Fora and Matyas Orhidi have started a series on log aggregation with Apache Flink:

There are several off-the-shelf solutions available on the market for log aggregation, which come with their own stack of components and operational difficulties. For example, notable logging frameworks that are widely used in the industry are ELK stack and Graylog. 

Unfortunately, there is no clear cut solution that works for every application, and different logging solutions might be more suitable for certain use cases. The log processing of real-time applications should for instance also happen in real-time, otherwise, we lose timely information that may be required to successfully operate the system.

In this blog post, we dive deep into logging for real-time applications.

This post is mostly understanding and setup, but it leads into processing and visualization.

Leave a Comment

Fraud Detection with Flink

Alexander Fedulov gives us a case study of using Apache Flink for fraud detection:

In this blog post, we have discussed the motivation behind supporting dynamic, runtime changes to a Flink application by looking at a sample use case – a Fraud Detection engine. We have described the overall architecture and interactions between its components as well as provided references for building and running a demo Fraud Detection application in a dockerized setup. We then showed the details of implementing a dynamic data partitioning pattern as the first underlying building block to enable flexible runtime configurations.

To remain focused on describing the core mechanics of the pattern, we kept the complexity of the DSL and the underlying rules engine to a minimum. Going forward, it is easy to imagine adding extensions such as allowing more sophisticated rule definitions, including filtering of certain events, logical rules chaining, and other more advanced functionality.

It was an interesting discussion and you can grab the code as well.

Leave a Comment

Using Koalas on Azure Databricks

Ginger Grant shows how you can install the koalas library on an Azure Databricks cluster:

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen.

Click through for a demo of what you need to do.

Leave a Comment

Streams and Tables in Apache Kafka

Michael Noll wraps up a series on Apache Kafka. First up is the fundamentals of Kafka Streams:

table is a, well, table in the ordinary technical sense of the word, and we have already talked a bit about tables before (a table in Kafka is today more like an RDBMS materialized view than an RDBMS table, because it relies on a change being made elsewhere rather than being directly updatable itself). Seen through the lens of event streaming however, a table is also an aggregated stream. This is a reference to the stream-table duality we discussed in part 1.

In the conclusion, Michael covers a few advanced topics:

Streams and tables are always fault tolerant because their data is stored reliably and durably in Kafka. This should be relatively easy to understand for streams by now as they map to Kafka topics in a straightforward manner. If something breaks while processing a stream, then we just need to re-read the underlying topic again.

For tables, it is more complex because they must maintain additional information—their state—to allow for stateful processing such as joins and aggregations like COUNT() or SUM(). To achieve this while also ensuring high processing performance, tables (through their state stores) are materialized on local disk within a Kafka Streams application instance or a ksqlDB server. But machines and containers can be lost, along with any locally stored data. How can we make tables fault tolerant, too?

This was a nice series.

Leave a Comment

Flink in Cloudera Streaming Analytics

Dinesh Chandrasekhar announces support for Apache Flink in Cloudera Streaming Analytics:

We cannot hold our excitement anymore! For the last few months, our Data-in-Motion engineering teams have been working hard to deliver a compelling and critical part of our Cloudera DataFlow (CDF) story. To enhance our Stream Processing and Analytics narrative within the overall Data-in-Motion platform, we give you support for Apache Flink with the general availability of Cloudera Streaming Analytics (CSA).

Cloudera Streaming Analytics, powered by Apache Flink, is a new product offering within the Cloudera DataFlow (CDF) platform that provides real-time stateful processing of IoT-scale data streams and complex events for predictive insights. Cloudera DataFlow (as seen in the picture below) is a comprehensive edge-to-cloud real-time streaming data platform. As one of the key pillars of CDF, stream processing & analytics is important for processing millions of data points and complex events coming from various streaming sources. Over the years, we have supported several streaming engines but the addition of Flink now makes CDF an extremely compelling platform for processing high-volumes of streaming data at high-scale. 

This is adding support for Flink; it looks like Spark Streaming and Kafka Streams are also supported, though they are pushing Flink as a first option rather than one among equals.

Leave a Comment

Databricks Automated Deployment and Testing

Li Yu, et al, explain how to use Databricks notebooks and MLflow to automate deployment and testing of Spark solutions:

Today many data science (DS) organizations are accelerating the agile analytics development process using Databricks notebooks.  Fully leveraging the distributed computing power of Apache Spark™, these organizations are able to interact easily with data at multi-terabytes scale, from exploration to fast prototype and all the way to productionize sophisticated machine learning (ML) models.  As fast iteration is achieved at high velocity, what has become increasingly evident is that it is non-trivial to manage the DS life cycle for efficiency, reproducibility, and high-quality. The challenge multiplies in large enterprises where data volume grows exponentially, the expectation of ROI is high on getting business value from data, and cross-functional collaborations are common.

In this blog, we introduce a joint work with Iterable that hardens the DS process with best practices from software development.  This approach automates building, testing, and deployment of DS workflow from inside Databricks notebooks and integrates fully with MLflow and Databricks CLI. It enables proper version control and comprehensive logging of important metrics, including functional and integration tests, model performance metrics, and data lineage. All of these are achieved without the need to maintain a separate build server.

Read on to see how.

Leave a Comment

Streams and Tables in Apache Kafka

Michael Noll has started a four-part series. Part one serves as a primer:

In my daily work as a member of Confluent’s Office of the CTO and as the former product manager for ksqlDB and Kafka Streams, I interact with many users of Apache Kafka—be it developers, operators, or architects. Some have a stream processing or Kafka background, some have their roots in relational databases like Oracle and MySQL, and some have neither. But many of them have the same set of technical questions, such as: What’s the difference between an event stream and a database table? Is a Kafka topic the same as a stream? How can I best leverage all these pieces when I want to put my data in Kafka to use?

By the end of this series, you will have answers to each of these common questions and many more. If you are interested to learn about Kafka, I invite you to join me on this journey through Kafka’s core fundamentals!

Part 2 looks at storage fundamentals:

Part 1 of this series discussed the basic elements of an event streaming platform: events, streams, and tables. We also introduced the stream-table duality and learned why it is a crucial concept for an event streaming platform like Apache Kafka®. Here in part 2, we will take a deep dive into Kafka’s storage fundamentals. Notably, we will explore topics and—in my opinion, the most important concept in Kafka: partitions.

We’ll start with the most basic storage question: how do I store data in Kafka?

I’m looking forward to parts 3 and 4.

Leave a Comment