Press "Enter" to skip to content

Category: Hadoop

Locking Mechanisms in Apache Hive

Shobika Selvaraj documents lock types in Apache Hive and what commands acquire which types:

In the Shared lock there are two types one is Shared_read and Shared_write. Shared_read means anyother shared_read and shared_write query can run at a time. Shared_write lock which means any other shared_read can be performed but no shared_write lock can acquire at that time.

In Exclusive locks no shared_read or shared_write can perform at the same time.

There are three types of lock state:

   (a) Acquired – transaction initiator hold the lock
   (b) Waiting – transaction initiator is waiting for the lock
   (c) Aborted – the lock has timed out but has not yet been cleaned

I was a bit surprised about inserts being shared read but that’s not a typo in the table—Shobika brings receipts.

Leave a Comment

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Leave a Comment

Data Governance in Databricks with Unity Catalog

Paul Roome, et al, announce the upcoming GA for Databricks Unity Catalog:

Today we are excited to announce that Unity Catalog, a unified governance solution for all data assets on the Lakehouse, will be generally available on AWS and Azure in the upcoming weeks. Currently, you can apply for a public preview or reach out to a member of your Databricks account team.

In a previous blog, we set out our vision for a governed lakehouse and how Unity Catalog can help customers simplify governance at scale. This blog will explore the most recent updates to Unity Catalog and our growing partner ecosystem.

Click through for those updates and to sign up for the public preview if so inclined.

Leave a Comment

Customer Segmentation via Databricks Solution Accelerator

Gavita Regunath discovers customer segments in a dataset:

We will be using the German Credit dataset, a publicly available dataset provided by Dr. Hans Hofmann of the University of Hamburg. The German Credit dataset contains features describing 1000 loan applicants who have taken credit from the bank. Using this dataset, our aim will be to understand the following “How should the bank personalise its products for its customers?”.

Click through to see an example of clustering to generate customer segments.

Leave a Comment

Software Engineering Practices for Notebooks

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

1. A report that runs regularly based on newer data and evolving business logic.

2. An ETL pipeline that needs to run on a regular schedule, or continuously.

3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Leave a Comment

PHI De-Identification in Databricks with NLP

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Leave a Comment

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Leave a Comment

Improving Join Performance with Skewed Datasets in Spark

Ajay Gupta gets into the topic of join performance:

Performing Joins on Skewed DatasetsA Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.

Implications of Skewed Datasets for Join: Skewed Datasets, if not handled appropriately, can lead to stragglers in the Join stage (Read this linked story to know more about Stragglers). This brings down the overall execution efficiency of the Spark job. Also, skewed datasets can cause memory overruns on certain executors leading to the failure of the Spark job. Therefore, it is important to identify and address Join-based stages where large skewed datasets are involved.

Read on for five techniques which may help you out.

Leave a Comment

Delta Live Tables and Power BI Data Modeling

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:

– How to easily develop and manage the data model’s transformation code.

– How to run and scale data pipelines for the model as data volumes grow.

– How to keep all the Delta Lake tables updated as new data arrives.

– How to view the lineage for all tables as the model gets more complex.

– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Leave a Comment

Automating Parallelism Decisions in Flink Batch Jobs

Lijie Wang and Zhu Zhu describe Apache Flink’s batch scheduler:

Deciding proper parallelisms of operators is not an easy work for many users. For batch jobs, a small parallelism may result in long execution time and big failover regression. While an unnecessary large parallelism may result in resource waste and more overhead cost in task deployment and network shuffling.

To decide a proper parallelism, one needs to know how much data each operator needs to process. However, It can be hard to predict data volume to be processed by a job because it can be different everyday. And it can be harder or even impossible (due to complex operators or UDFs) to predict data volume to be processed by each operator.

To solve this problem, we introduced the adaptive batch scheduler in Flink 1.15. The adaptive batch scheduler can automatically decide parallelism of an operator according to the size of its consumed datasets. 

Read on to see some of the benefits of using the adaptive batch scheduler, as well as some of the decision points it uses along the way.

Leave a Comment