Press "Enter" to skip to content

Category: Hadoop

Securing Amazon Managed Streaming for Kafka

Stephane Maarek has some security advice for us:

AWS launched IAM Access Control for Amazon MSK, which is a security option offered at no additional cost that simplifies cluster authentication and Apache Kafka API authorization using AWS Identity and Access Management (IAM) roles or user policies to control access. This eliminates the need for administrators to run an unfamiliar system to control access to Apache Kafka on Amazon MSK, and learn intricate details and specific commands to manage Apache Kafka access control lists (ACLs).

This is a game-changer from a security perspective for AWS customers who use Apache Kafka: I recommend Amazon MSK customers use IAM Access Control unless they have a specific need for using mutual TLS or SASL/SCRAM authN/Z.

Read on to see how it works.

Comments closed

Surviving a Kafka Outage

Jakub Korab walks us through availability features in Kafka as well as what to expect if your brokers are unavailable:

In the case of an outage, you have to ensure that these messages can be processed eventually. Keeping unsent messages around and retrying indefinitely in the hopes that the outage will rectify may eventually result in your application running out of memory. This is a crucial consideration in high-throughput applications.

If business functions are performed by systems downstream of Kafka, and the sending application only acts as an ingestion point, the situation is slightly more relaxed. If Kafka is unavailable to send messages to, then no external activity has taken place. For these systems, a Kafka outage might mean that you do not accept new transactions. In such a case, it may be reasonable to return an error message and allow the external third party to retry later. Retail applications typically fall into this category.

Read the whole thing.

Comments closed

Recent Apache NiFi Updates

Pierre Villard has some news for us around Apache NiFi:

Cloudera released a lot of things around Apache NiFi recently! We just released Cloudera Flow Management (CFM) 2.1.1 that provides Apache NiFi on top of Cloudera Data Platform (CDP) 7.1.6. This major release provides the latest and greatest of Apache NiFi as it includes Apache NiFi 1.13.2 and additional improvements, bug fixes, components, etc. Cloudera also released CDP 7.2.9 on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 and more.  Let’s have a look at the main highlights of these releases.

Click through to see what’s included.

Comments closed

HDFS Data Encryption at Rest

Arun Kumar Natva takes us through the process of encrypting data at rest in Cloudera Data Platform:

HDFS Encryption delivers transparent end-to-end encryption of data at rest and is an integral part of HDFS. End to end encryption means that the data is only encrypted and decrypted by the client. In other words, data remains encrypted until it reaches the HDFS client.

Each HDFS file is encrypted using an encryption key. To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. To add another layer of security, the file encryption key is stored in encrypted form, using another “encryption zone key”.

Read on to learn more and to see how it all fits together.

Comments closed

reduceByKey and aggregateByKey in Spark

The Hadoop in Real World team compares two functions against RDDs in Spark:

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.

First _+_  function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

This is a demo-driven post, so check it out.

Comments closed

Apache Kafka 2.8 Released

John Roesler announces Apache Kafka 2.8:

We are excited to announce that 2.8 introduces an early-access look at Kafka without ZooKeeper! The implementation is not yet feature complete and should not be used in production, but it is possible to start new clusters without ZooKeeper and go through basic produce and consume use cases.

At a high level, KIP-500 works by moving topic metadata and configurations out of ZooKeeper and into a new internal topic named @metadata. This topic is managed by an internal Raft quorum of “controllers” and is replicated to all brokers in the cluster. The leader of the Raft quorum serves the same role as the controller in clusters today. A node in the KIP-500 world can serve as a controller, a broker, or both, depending on the new process.roles configuration. See the README for quickstart instructions and additional details.

In addition to the headline item, there are plenty of other bugfixes and additions as well.

Comments closed

Querying Serverless SQL Pools from Spark Notebooks in Scala

Jovan Popovic shows off one integration point between the data services in Azure Synapse Analytics:

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.

You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.

In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Read on to see how. You can also query Spark pool and dedicated SQL pool tables from serverless SQL pools.

4 Comments

Removing a Node from a Hadoop Cluster

The Hadoop in Real World team shows us the proper way to remove a node from a Hadoop cluster:

This post will list out the steps to properly remove a node from a Hadoop cluster. It is not advisable to just shut down the node abruptly.

Node exclusions should be properly recorded in a file that is referred to by the property dfs.hosts.exclude. This property doesn’t have default value so in the absence of a file location and a file, the Hadoop cluster will not exclude any nodes.

Read on for more information, including what happens if you simply turn off the node.

Comments closed

A Modern C++ Kafka API

Kenneth Jia and Benedek Thaler announce an open source library:

Morgan Stanley uses Apache Kafka® to publish market data to internal clients and to persist it for replay purposes. We started out using librdkafka’s C++ API, which maintains C++98 compatibility. C++ is evolving quickly, and we wanted to break away from this compatibility requirement so we could take advantage of new C++ features. This led us to create a new C++ API for Kafka that uses modern C++ features (i.e. C++14 and later). We’ve open sourced this client and hope you enjoy it.

Click through to learn more. What interests me about this is that most of the other languages’ support for Kafka (for example, .NET) is based off of librdkafka. I don’t know if there’s any benefit to moving to this new library.

Comments closed

Geospatial Fraud Detection

Antoine Amend uses Databricks to identify financial fraud in a geographical area:

As part of this real-world solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviors at massive scale, track customers patterns over time and detect anomalous card transactions. Finally, we demonstrate how organizations can surface anomalies from an analytics environment to an online data store (ODS) with tight SLA requirements following a Lambda-like infrastructure underpinned by Delta Lake, Apache Spark and MLflow.

Click through for the article, as well as three notebooks.

Comments closed