Category: Hadoop

Scaling Kafka Down

Published 2020-06-22 by Kevin Feasel

Jay Kreps announces an update to Project Metamorphosis:

Rightly or wrongly, Kafka is viewed as a “heavy-duty” component, one where you have to ask, “Do we really really need it?” and wait until you have several use cases to justify adoption. Until then people often abuse other systems or try to get by as best they can without first-class support for events in their infrastructure. This is unfortunate, and we want to change it! A log of events is a fundamentally better abstraction and the ecosystem around Kafka is phenomenal. Limiting this to only large-scale users and big tech companies does a disservice to the small apps and use cases that are often some of the best uses of the technology.
We’re changing that. We offer three types of clusters, optimized for the type of use case. Our Basic cluster type allows you to just pay for the amount of Kafka you use with no cost at all for allocating an unused cluster. This scales down to zero cost and scales up very gradually. You can literally use a few dollars a month of Kafka with no additional operational overhead. This makes Kafka a very lightweight option, at very low cost, for even the smallest of use cases. This offering is also ideal for development and test environments, even for those using Dedicated clusters for their production environments.

This does seem to help out in cases where I’d otherwise suggest Kinesis or Event Hubs (or one of the five dozen other Azure products around brokering messages) due to the relative lack of volume.

Comments closed

Building Data Pipelines with Apache NiFi

Published 2020-06-19 by Kevin Feasel

The Hadoop in Real World team takes a look at Apache NiFi:

NiFi is an easy to use tool which prefers configuration over coding.
However, NiFi is not limited to data ingestion only. NiFi can also perform data provenance, data cleaning, schema evolution, data aggregation, transformation, scheduling jobs and many others. We will discuss these in more detail in some other blog very soon with a real world data flow pipeline.
Hence, we can say NiFi is a highly automated framework used for gathering, transporting, maintaining and aggregating data of various types from various sources to destination in a data flow pipeline.

Click through for an example with instructions. The feeling is pretty close to Informatica or SQL Server Integration Services, so if you’re an old hand at one of those, you’ll get into this pretty easily.

Comments closed

Window Functions in Spark SQL

Published 2020-06-18 by Kevin Feasel

Juoko Virtanen walks us through window functions in Spark SQL:

When you think of windows in Spark you might think of Spark Streaming, but windows can be used on regular DataFrames. Window functions calculate an output value for every row of a DataFrame based on a group of rows. I have been working on optimizing some Spark code and have noticed a few places where the use of a window function eliminates the need for a join and speeds up the code. A common pattern where a window can be used to replace a join is when an aggregation is performed on a DataFrame and then the DataFrame resulting from the aggregation is joined to the original DataFrame. Let’s take a look at an example.

Read on for a few examples using the Scala flavor of Spark SQL.

Comments closed

Using Apache Flink in Zeppelin Notebooks

Published 2020-06-16 by Kevin Feasel

Jeff Zhang walks us through reviewing data streamed through Apache Flink in an Apache Zeppelin notebook:

In this post, we explained how the redesigned Flink interpreter works in Zeppelin 0.9.0 and provided some examples for performing streaming ETL jobs with Flink and Zeppelin. In the next post, I will talk about how to do streaming data visualization via Flink on Zeppelin. Besides that, you can find an additional tutorial for batch processing with Flink on Zeppelin as well as using Flink on Zeppelin for more advance operations like resource isolation, job concurrency & parallelism, multiple Hadoop & Hive environments and more on our series of posts on Medium. And here’s a list of Flink on Zeppelin tutorial videos for your reference.

Click through for the demo, and stay tuned for part 2.

Comments closed

The Architecture of Apache Kafka

Published 2020-06-16 by Kevin Feasel

Michael Carter walks us through the key components and concepts behind Apache Kafka:

Despite its name’s suggestion of Kafkaesque complexity, Apache Kafka’s architecture actually delivers an easier to understand approach to application messaging than many of the alternatives. Kafka is essentially a commit log with a very simplistic data structure. It just happens to be an exceptionally fault-tolerant and horizontally scalable one.
The Kafka commit log provides a persistent ordered data structure. Records cannot be directly deleted or modified, only appended onto the log. The order of items in Kafka logs is guaranteed. The Kafka cluster creates and updates a partitioned commit log for each topic that exists. All messages sent to the same partition are stored in the order that they arrive. Because of this, the sequence of the records within this commit log structure is ordered and immutable. Kafka also assigns each record a unique sequential ID known as an “offset,” which is used to retrieve data.

Read on to learn more about the key ideas, such as producers, consumers, and partitions.

Comments closed

Smoothing Out Write Behavior in Apache Flink

Published 2020-06-15 by Kevin Feasel

Dmitry Tolpeko solves an interesting problem:

It would be nice to smooth S3 write operations between two checkpoints. How to do that?
You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.
So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.
So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency?

Read on for the answer.

Comments closed

Open-sourcing Kube2Hadoop

Published 2020-06-12 by Kevin Feasel

Cong Gu, et al, announce the open-sourcing of a project:

By default, there is a gap between the security model of Kubernetes and Hadoop. Specifically, Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography to ensure any clients accessing the cluster are who they claim to be. In order to avoid frequent authentication checks against a Kerberos server, Delegation Tokens, a lightweight two-party authentication method, was introduced to complement Kerberos authentication. The Hadoop delegation token by default has a lifespan of one day and can be renewed for up to seven days. Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs. Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.
To allow for Kubernetes workloads to securely access HDFS, we built Kube2Hadoop, a scalable and secure integration with HDFS Kerberos. This enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account. Headless accounts are oftentimes used to denote a virtual team that is working on projects that would share the same data within the team. The data acquired can then be used in their model exploration and training with KubeFlow components such as the tf-operator and mpi-operator. In this blog, we will describe the design and authentication model of Kube2Hadoop.

Read on to see how it works and a link to the GitHub repo.

Comments closed

Troubleshooting Kafka Remote Connections

Published 2020-06-10 by Kevin Feasel

Robin Moffatt explains common errors people run into when trying to connect to remote Kafka clusters:

In this example, my client is running on my laptop, connecting to Kafka running on another machine on my LAN called asgard03:
The initial connection succeeds. But note that the BrokerMetadata we get back shows that there is one broker, with a hostname of localhost. That means that our client is going to be using localhost to try to connect to a broker when producing and consuming messages. That’s bad news, because on our client machine, there is no Kafka broker at localhost (or if there happened to be, some really weird things would probably happen).

As usual, things boil down to “Configure it correctly and it works.”

Comments closed

Vectorized R I/O in Apache Spark 3.0

Published 2020-06-10 by Kevin Feasel

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.
Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.
This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed

Azure Active Directory and the DatabricksPS Library

Published 2020-06-10 by Kevin Feasel

Gerhard Brueckl has updated the DatabricksPS library:

Databricks recently announced that it is now also supporting Azure Active Directory Authentication for the REST API which is now in public preview. This may not sound super exciting but is actually a very important feature when it comes to Continuous Integration/Continuous Delivery pipelines in Azure DevOps or any other CI/CD tool. Previously, whenever you wanted to deploy content to a new Databricks workspace, you first needed to manually create a user-bound API access token. As you can imagine, manual steps are also bad for otherwise automated processes like a CI/CD pipeline. With Databricks REST API finally supporting Azure Active Directory Authentication of regular users and service principals, this last manual step is finally also gone!

If you do use Databricks and haven’t tried out DatabricksPS, I highly recommend it. I think it’s a much nicer experience than hitting the REST API directly, particularly because it deals with continuation tokens and making multiple calls to get your results.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31