Press "Enter" to skip to content

Category: Spark

Summarizing Data & AI Summit Announcements

Zach Stagers hits the high notes:

One of the biggest cheers of the keynote was that Delta is being fully open sourced! Databricks continue to share their incredible work to help drive our industry forward. Delta already has wide adoption, but with the open sourced version now being levelled up to the same standard as the ‘proprietary’ one, this should help cement it as the default choice for lake-based storage.

There were some announcements of things to come with Delta too, such as a optimised deletes and updates by removing single rows instead of having to completely rewrite the file. It’ll be really interesting to see how this works, and just how much it boosts performance.

Read on for more notes on several big announcements.

Comments closed

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed

Data Governance in Databricks with Unity Catalog

Paul Roome, et al, announce the upcoming GA for Databricks Unity Catalog:

Today we are excited to announce that Unity Catalog, a unified governance solution for all data assets on the Lakehouse, will be generally available on AWS and Azure in the upcoming weeks. Currently, you can apply for a public preview or reach out to a member of your Databricks account team.

In a previous blog, we set out our vision for a governed lakehouse and how Unity Catalog can help customers simplify governance at scale. This blog will explore the most recent updates to Unity Catalog and our growing partner ecosystem.

Click through for those updates and to sign up for the public preview if so inclined.

Comments closed

Customer Segmentation via Databricks Solution Accelerator

Gavita Regunath discovers customer segments in a dataset:

We will be using the German Credit dataset, a publicly available dataset provided by Dr. Hans Hofmann of the University of Hamburg. The German Credit dataset contains features describing 1000 loan applicants who have taken credit from the bank. Using this dataset, our aim will be to understand the following “How should the bank personalise its products for its customers?”.

Click through to see an example of clustering to generate customer segments.

Comments closed

Software Engineering Practices for Notebooks

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

1. A report that runs regularly based on newer data and evolving business logic.

2. An ETL pipeline that needs to run on a regular schedule, or continuously.

3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

PHI De-Identification in Databricks with NLP

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Improving Join Performance with Skewed Datasets in Spark

Ajay Gupta gets into the topic of join performance:

Performing Joins on Skewed DatasetsA Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.

Implications of Skewed Datasets for Join: Skewed Datasets, if not handled appropriately, can lead to stragglers in the Join stage (Read this linked story to know more about Stragglers). This brings down the overall execution efficiency of the Spark job. Also, skewed datasets can cause memory overruns on certain executors leading to the failure of the Spark job. Therefore, it is important to identify and address Join-based stages where large skewed datasets are involved.

Read on for five techniques which may help you out.

Comments closed

Delta Live Tables and Power BI Data Modeling

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:

– How to easily develop and manage the data model’s transformation code.

– How to run and scale data pipelines for the model as data volumes grow.

– How to keep all the Delta Lake tables updated as new data arrives.

– How to view the lineage for all tables as the model gets more complex.

– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Comments closed