Press "Enter" to skip to content

Category: Hadoop

Handling Bad Records with Apache Spark

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:

– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Tips for Moving from Pandas to Koalas

Haejoon Lee, et al, walk us through migrating existing code written for Pandas to use the Koalas library:

In particular, two types of users benefit the most from Koalas:

– pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
– Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Read on to learn more.

Comments closed

Accessing Blob Storage from Azure Databricks

Gauri Mahajan shows how we can read data in Azure Blob Storage from Azure Databricks:

Since our base set-up comprising of Azure Blob Storage (with a .csv file) and Azure Databricks Service (with a Scala notebook) is in place, let’s talk about the structure of this article. We will demonstrate the following in this article:

1. We will first mount the Blob Storage in Azure Databricks using the Apache Spark Scala API. In simple words, we will read a CSV file from Blob Storage in the Databricks
2. We will do some quick transformation to the data and will move this processed data to a temporary SQL view in Azure Databricks. We will also see how we can use multiple languages in the same databricks notebook
3. Finally, we will write the transformed data back to the Azure blob storage container using the Scala API

It’s just a few lines of code. One of the best things Microsoft and the Databricks team did for Azure Databricks was to ensure that it felt like a first-party offering—everything feels a little more integrated than Databricks for AWS.

Comments closed

The Flink-Hive Integration

Bowen Li takes us through Apache Flink 1.10’s integration with Apache Hive:

On the other hand, Apache Hive has established itself as a focal point of the data warehousing ecosystem. It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered and defined. As business evolves, it puts new requirements on data warehouse.

Thus we started integrating Flink and Hive as a beta version in Flink 1.9. Over the past few months, we have been listening to users’ requests and feedback, extensively enhancing our product, and running rigorous benchmarks (which will be published soon separately). I’m glad to announce that the integration between Flink and Hive is at production grade in Flink 1.10 and we can’t wait to walk you through the details.

Click through to see how it works.

Comments closed

Data Exfiltration Protection when Using Azure Databricks

Bhavin Kukadia, et al, explain how to prevent users from taking data from your Databricks cluster without authorization:

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

Click through for the architectural pattern.

Comments closed

Schema Management for Spark Applications

Walaa Eldin Moustafa takes us through some of the things that LinkedIn has learned about schema management with Apache Spark:

At LinkedIn, the Hive Metastore is the source of truth catalog for all Hadoop data. The Hive Metastore is managed by Dali. Dali is a data access and processing platform that is integrated to compute engines and ETL pipelines at LinkedIn to ensure consistency and uniformity in the access and storage of data. Dali utilizes the Hive Metastore to store data formats, data locations, partition information, and table information. Among other features, Dali also manages the definition of SQL views, as well as storing and accessing those definitions from the Hive Metastore.

Read on for a good explanation of the how as well as the why.

Comments closed

Working with Spark.Net on Azure Synapse Analytics

Paul Andrew takes a look at Spark.NET (or Spark.Net or dotnet-spark or however I’m calling it this time):

The main reason I wanted access to Synapse is to play around with Spark.Net via the Synapse workspace Notebooks. Currently if deploying Synapse via the public Azure portal you only get the option to create a SQL compute pool, formally known as an Azure SQLDW. While this is good, it gives us none of the exciting things that we were shown about Synapse back in November last year during the Microsoft Ignite conference.

To get the good stuff in Azure Synapse Analytics you need access to the full developer UI and Synapse Workspace.

Click through to learn more about the experience.

Comments closed

Using Azure Key Vault with Azure Databricks

Jason Bonello shows how easy it is to integrate Azure Key Vault into Azure Databricks:

In Azure Key Vault we will be adding secrets that we will be calling through Azure Databricks within the notebooks. First and foremost, this is for security purposes. It will ensure usernames and passwords are not hardcoded within the notebook cells and offer some type of control over access in case it needs to be reverted later on (assuming it is controlled by a different administrator). In addition to this, it will offer a better way of maintaining a solution, since if a password ever needs to be changed, it will only be changed in the Azure Key Vault without the need to go through any notebooks or logic.

If you don’t use Key Vault, Databricks does include its own secrets storage, so there’s really no reason to keep them in plaintext.

1 Comment

Authentication in Hadoop with Apache Ozone

Xiaoyu Yao explains how we can use Apache Ozone to perform service account authentication for a Hadoop cluster:

Like Hadoop delegation tokens, Ozone security token has a token identifier along with a signed signature from the issuer. Ozone manager issues delegation token and block tokens for users or client applications authenticated with Kerberos. The signature of the token can be validated by token validators to verify the identity of the issuer. This way, a valid token holder can use the token to perform operations against the cluster services as if they have Kerberos tickets of the issuer. 

Read on for the high-level overview.

Comments closed