Press "Enter" to skip to content

Category: Data

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed

Unicode Character Generation in Power Query

Meagan Longoria needs more Unicode:

You may have used the UNICHAR() function in DAX to return Unicode characters in DAX measures. If you haven’t yet read Chris Webb’s blog post on the topic, I recommend you do. But did you know there is a Power Query function that can return Unicode characters? This can be useful in cases when you want to assign a Unicode character to a categorical value.

Click through to see how this works.

Comments closed

Data Governance in Databricks with Unity Catalog

Paul Roome, et al, announce the upcoming GA for Databricks Unity Catalog:

Today we are excited to announce that Unity Catalog, a unified governance solution for all data assets on the Lakehouse, will be generally available on AWS and Azure in the upcoming weeks. Currently, you can apply for a public preview or reach out to a member of your Databricks account team.

In a previous blog, we set out our vision for a governed lakehouse and how Unity Catalog can help customers simplify governance at scale. This blog will explore the most recent updates to Unity Catalog and our growing partner ecosystem.

Click through for those updates and to sign up for the public preview if so inclined.

Comments closed

PHI De-Identification in Databricks with NLP

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Example Data Pre-Processing Activities

Aayush Srivastava takes us through some pre-processing activities in machine learning:

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

Unity Catalog in Azure Databricks

Paul Roome, et al, announce Unity Catalog:

We are excited to announce that data lineage for Unity Catalog, the unified governance solution for all data and AI assets on lakehouse, is now available in preview.

This blog will discuss the importance of data lineage, some of the common use cases, our vision for better data transparency and data understanding with data lineage, and a sneak peek into some of the data provenance and governance features we’re building.

Click through to see what it currently supports. My curious question is around whether this and Microsoft Purview will play nice in an Azure Databricks setup.

Comments closed

Knowledge Graphs, Data Fabrics, and Data Meshes

Alan Morrison describes the history of three concepts:

By 2014, SAP was using “in-memory data fabric” to describe a virtual data warehouse, a key element of its HANA “360-degree customer view” product line. Gartner for its part uses the term “data fabric” to this day as an all-encompassing means of heterogeneous data integration. From a 2021 post on data fabric architecture: 

Read on for a high-level discussion of what each is and how it fits into the context of data warehouses and data lakes.

Comments closed

Searching for Strings in a Database

Aaron Bertrand updates an old tip:

Back in 2015, I wrote a tip called Search all string columns in all SQL Server databases. That tip focused on finding strings within string-based columns in all tables across all user databases. I was recently asked if this could be made more flexible; for example, can it search views as well, and can it search only in a specific database?

Click through for a new version which works in SQL Server 2016 and later.

Comments closed

File Format Throwdown

Tomaz Kastrun tries out several file formats in Azure Data Lake Storage (Gen2):

CSV data format is an old format and very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

We will be looking at benchmarking the CRUD operations with different data formats; from CSV to ORC, Parquet, AVRO and others with the simple Azure data storage operations, like Create, Write, read and transform.

It’s important to remember that Parquet and ORC are intended to solve radically different problems than Avro. Parquet and ORC are columnar datasets intended to aggregate quickly and efficiently, whereas Avro is intended for efficient row storage. CSV is intended for easy-to-work-with row storage.

Then, Tomaz follows up with some R:

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

The Feather file format, by the way, comes from Apache Arrow and works especially well with Python and R. You might not get the same performance benefits in other languages, depending on its library support.

Comments closed

“Production” in Data Analytics

Joey Jablonski brings up an important point:

Data-driven environments have a fundamentally different set of needs around testing, deployment, and visibility then traditional business applications. Data driven environments need access to fresh data on a high level of update frequency to ensure that data engineers and data scientists are able to effect outputs and recommendations on a timeline that has a positive impact on business decisions and customer experiences.

My day job involves running a predictive analytics team. We train models on production data—there’s very little value in training models on artificial dev data (outside of understanding the parameters of the modeling process), so even our development data generally comes from production. I don’t know that I’m sold on data mesh as a solution to this but it’s worth investigation.

Comments closed