Hadoop – Page 31 – Curated SQL

You can use Azure Synapse and Azure Databricks to prepare and modify your Delta Lake data sets placed in the Azure Data Lake storage. Once your data engineers have prepared the data, your data analysts can create reports using the tools such as Power BI.
Using the serverless query endpoint in Azure Synapse, you can create a relational layer on top of your Delta Lake files that directly references the location where Azure Synapse and Azure Databricks are used to modify data. This way, you can get the real-time analytics on top of the Delta Lake data set without any need to wait for a pipeline to copy and prepare data.

Read on to see how this works.

Comments closed

Securing Databricks on AWS

Published 2021-05-27 by Kevin Feasel

Andrew Weaver, et al, take us through security practices for running Databricks on AWS:

In this article, we will share a list of cloud security features and capabilities that an enterprise data team can use to harden their Databricks environment on AWS as per their risk profile and governance policy. For more information about how Databricks runs on Amazon Web Services (AWS), view the AWS web page and Databricks security on AWS page for more specific details on security and compliance.

Click through for that list.

Comments closed

Error Handling Patterns in Kafka

Published 2021-05-27 by Kevin Feasel

Gerardo Villeda gives a few options for handling errors in an Apache Kafka topic:

Apache Kafka^® applications run in a distributed manner across multiple containers or machines. And in the world of distributed systems, what can go wrong often goes wrong. This blog post covers different ways to handle errors and retries in your event streaming applications. The nature of your process determines the patterns, and more importantly, your business requirements.
This blog provides a quick guide on some of those patterns and expands on a common and specific use case where events need to be retried following their original order. This blog post illustrates a scenario of an application that consumes events from one topic, transforms those events, and produces an output to a target topic, covering different approaches as they gradually increase in complexity.

Click through for the list. Each explanation is pretty short, but opens the door for further analysis.

Comments closed

Azure Synapse Analytics Supports Apache Spark 3.0

Published 2021-05-26 by Kevin Feasel

Euan Garden has some great news for us:

Starting today, the Apache Spark 3.0 runtime is now available in Azure Synapse. This version builds on top of existing open source and Microsoft specific enhancements to include additional unique improvements listed below. The combination of these enhancements results in a significantly faster processing capability than the open-source Spark 3.0.2 and 2.4.
The public preview announced today starts with the foundation based on the open-source Apache Spark 3.0 branch with subsequent updates leading up to a Generally Available version derived from the latest 3.1 branch.

It still won’t be as fast as Databricks, but it should be a good bit faster than the Spark 2 they were running.

Comments closed

Broadcast Variables in Apache Spark

Published 2021-05-25 by Kevin Feasel

The Hadoop in Real World team explains the notion of broadcast variables in Apache Spark:

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks executing as part of the application. Broadcast variables are sent to the executors only once and it is available for all tasks executing in the executors.

Read on to understand when they are useful and, just as importantly, when not to use them. They seem like the type of thing which a newer developer could easily misuse.

Comments closed

Understanding Consumer Lag in Apache Kafka

Published 2021-05-25 by Kevin Feasel

Loretta Jones takes us through the notion of consumer lag in an Apache Kafka topic:

Amongst various metrics that Kafka monitoring includes consumer lag is nearly the most important of them all. In this post, we will explore potential reasons for Kafka consumer lag and what you could do when you experience lag.

This post is fairly high-level, and it does a good job of explaining the notion to someone without much familiarity with Kafka.

Comments closed

Data Pipeline Error Handling with Apache NiFi

Published 2021-05-24 by Kevin Feasel

Pieter Humphrey gives us a few techniques for handling data pipeline errors when running Apache NiFi:

The more complex the model, the more possible sources of problems exist. Forecasting every single potential problem is, of course, impossible. Identifying the most important ones and providing self-solving solutions can greatly reduce the operational uncertainty of our NiFi pipeline and improve its robustness.
To see how to do this analysis, we will consider four possible strategies: one external and three internal. They certainly do not cover all potential error scenarios, they are just examples that we can extrapolate from, and inform how to handle other potential failure domains.

Click through for an overview of the topic as well as those four techniques.

Comments closed

Finding Duplicates in a Spark DataFrame

Published 2021-05-21 by Kevin Feasel

The Hadoop in Real World team shows how to deduplicate rows in a DataFrame in Spark:

It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()

Where “easy” has as a modifier just how many columns you’re dealing with in the DataFrame.

Comments closed

Explode and PosExplode in Hive

Published 2021-05-17 by Kevin Feasel

The Hadoop in Real World team talks about two of my favorite function names in Hive:

Both explode and posexplode are User Defined Table generating Functions. UDTFs operate on single rows and produce multiple rows as output.

Click through to learn what each of them does.

Comments closed

Learning the Basics of Kafka via Notebook

Published 2021-05-14 by Kevin Feasel

Francesco Tisiot shares a way to learn about the basics of Apache Kafka using Jupyter notebooks:

One of the best ways to learn a new technology is to try it within an assisted environment that anybody can replicate and get working within few minutes. Notebooks represent an excellence in this field by allowing people to share and use pre-built content which includes written descriptions, media and executable code in a single page.
This blog post aims to teach you the basics of Apache Kafka Producers and Consumers through building an interactive notebook in Python. If you want to browse a full ready-made solution instead, check out our dedicated github repository.

The classic tutorials tend to use a couple command prompts and the built-in producer and consumer shell scripts. I like this approach as a way of being able to review the code and results later as a refresher.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Querying Delta Lake Files with T-SQL in Azure Synapse Analytics

Securing Databricks on AWS

Error Handling Patterns in Kafka

Azure Synapse Analytics Supports Apache Spark 3.0

Broadcast Variables in Apache Spark

Understanding Consumer Lag in Apache Kafka

Data Pipeline Error Handling with Apache NiFi

Finding Duplicates in a Spark DataFrame

Explode and PosExplode in Hive

Learning the Basics of Kafka via Notebook