Press "Enter" to skip to content

Category: Hadoop

Static versus Dynamic Partitioning in Hive

The Hadoop in Real World team explains the difference between two partitioning strategies:

The difference between static and dynamic partitioning only exists when the partition is being created based on how the partitions are added to the table. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.

Click through for the difference.

Comments closed

Configuring a Debezium Connector for Event Hub Streaming

Niels Berglund continues a series:

This series came about as I in the post How to Use Kafka Client with Azure Event Hubs, somewhat foolishly said:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

I wrote the above without testing it myself, so when I was called out on it, I started researching (read “Googling”) to see if it was possible. The result of the “Googling” didn’t give a 100% answer, so I decided to try it out, and this series is the result.

In the first post, – as mentioned – we configured Kafka Connect to connect into Event Hubs. In this post, we look at configuring the Debezium connector.

Click through and enjoy the fruits of Berglund’s Folly—which, as far as it goes, I’d still rate Seward’s Folly higher but this one’s pretty good too.

Comments closed

MLOps on Databricks

Piotr Majer and Michael Shtelma complete a series on MLOps on Databricks:

This is the second part of a two-part series of blog posts that show an end-to-end MLOps framework on Databricks, which is based on Notebooks. In the first post, we presented a complete CI/CD framework on Databricks with notebooks. The approach is based on the Azure DevOps ecosystem for the Continuous Integration (CI) part and Repos API for the Continuous Delivery (CD). This post extends the presented CI/CD framework with machine learning providing a complete ML Ops solution.

Check it out.

Comments closed

Streaming Data to Event Hubs via Kafka Connect and Debezium

Niels Berglund starts off a two-part sub-series within a series:

This post is the first of two looking at if and how we can stream data to Event Hubs from Debezium. Initially I had planned only one post covering this, but it turned out that the post would be too long, so therefore I split it in two.

It started with the post, How to Use Kafka Client with Azure Event Hubs. In that post, I looked at how the Kafka client can publish messages to – not only – Apache Kafka but also Azure Event Hubs. In the post, I said something like:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

Click through for the first part of this pairing.

Comments closed

Build a Sandbox for Testing PolyBase and Hadoop

Fernando Sibaja Araya has a step-by-step guide to building a Hadoop sandbox for testing PolyBase on SQL Server:

This guide will take you step by step into deploying a hadoop sandbox into Azure. You then will connect to the sandbox through SSH and tunnel all the required ports to your machine so you can access all the endpoints to execute hadoop queries from Polybase.

We will be deploying Hortonworks Data Platform Sandbox 2.6.4. This will be 1 VM running in azure and within this VM a docker container will have all the HDP services running.

Click through for the full set of instructions. I’m a little overjoyed that my blog snuck into the set of links and resources at the end.

Comments closed

Apache Flink ML 2.0.0

Dong Lin and Yun Gao make an announcement:

The Apache Flink community is excited to announce the release of Flink ML 2.0.0! Flink ML is a library that provides APIs and infrastructure for building stream-batch unified machine learning algorithms, that can be easy-to-use and performant with (near-) real-time latency.

This release involves a major refactor of the earlier Flink ML library and introduces major features that extend the Flink ML API and the iteration runtime, such as supporting stages with multi-input multi-output, graph-based stage composition, and a new stream-batch unified iteration library. Moreover, we added five algorithm implementations in this release, which is the start of a long-term initiative to provide a large number of off-the-shelf algorithms in Flink ML with state-of-the-art performance.

Congratulations to everybody who contributed to the project; it’s a big milestone.

Comments closed

Combining Azure DevOps and Databricks

Anna Wykes continues a series on DevOps for Databricks:

An Environment Variable is a variable stored outside of the Python script; in our instance it will be stored on the DevOps Agent running the DevOps Pipelines. Consequently, it is accessible to other scripts/programs running on the DevOps Agent. We will not cover DevOps Agents in this blog specifically, the simplest description is that they are the compute that runs your pipeline, normally a VM (Virtual Machine) or Docker Container

Read the whole thing.

Comments closed

When Not to Use Apache Kafka

Kai Waehner looks at when we may (or may not) want to use Apache Kafka:

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

I appreciate this kind of post a lot, especially from someone directly invested in the product. No technology can or should fit all purposes and the better you can explain where something does not fit, the better you can explain where it does fit.

Comments closed

Improving Apache Flink Scheduler Performance

Zhilong Hong, et al, share some interesting results out of Apache Flink 1.14. Part one lays out the scene:

To estimate the effect of our optimizations, we conducted several experiments to compare the performance of Flink 1.12 (before the optimization) with Flink 1.14 (after the optimization). The job in our experiments contains two vertices connected with an all-to-all edge. The parallelisms of these vertices are both 10K. To make temporary deployment descriptors distributed via the blob server, we set the configuration blob.offload.minsize to 100 KiB (from default value 1 MiB). This configuration means that the blobs larger than the set value will be distributed via the blob server, and the size of deployment descriptors in our test job is about 270 KiB. The results of our experiments are illustrated below:

Part two explains their improvements:

In Flink 1.12, the ExecutionEdge class is used to store the information of connections between tasks. This means that for the all-to-all distribution pattern, there would be O(n2) ExecutionEdges, which would take up a lot of memory for large-scale jobs. For two JobVertices connected with an all-to-all edge and a parallelism of 10K, it would take more than 4 GiB memory to store 100M ExecutionEdges. Since there can be multiple all-to-all connections between vertices in production jobs, the amount of memory required would increase rapidly.

As we can see in Fig. 1, for two JobVertices connected with the all-to-all distribution pattern, all IntermediateResultPartitions produced by upstream ExecutionVertices are isomorphic, which means that the downstream ExecutionVertices they connect to are exactly the same. The downstream ExecutionVertices belonging to the same JobVertex are also isomorphic, as the upstream IntermediateResultPartitions they connect to are the same too. Since every JobEdge has exactly one distribution type, we can divide vertices and result partitions into groups according to the distribution type of the JobEdge.

Click through for a dive into the architecture.

Comments closed