Press "Enter" to skip to content

Category: Spark

Azure Databricks Initialization Scripts

Alex Crampton explains how initialization scripts work in Azure Databricks:

This blog will demonstrate the use of cluster-scoped initialisation scripts for Azure Databricks. An example will run through how to configure an initialisation script to install libraries on to a cluster that are not included in the Azure Databricks runtime environment. It will cover how to do this firstly using the Databricks UI, followed by how to include it in your CI/CD solutions.

Read on for some examples.

Comments closed

Databricks Extension for VSCode at 1.0

Gerhard Brueckl shares the good news:

As you probably know from my previous posts, my colleagues at paiqo.com and I are constantly working to improve our VSCode extension for Databricks. Almost every month we silently release a new version to the VSCode gallery so you get the latest features. However, as this is a special release, I am also writing a dedicated blog post for it

There’s a lot of cool stuff in here, so check it out.

Comments closed

Feeding Synapse Spark Info to On-Prem Kafka Clusters

Bhadreshkumar Shiyal finds a solution:

Microsoft’s official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint – Azure Data Fac….

Although based upon the article’s solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The “Forwarding Vnet” concept explained in the above tutorial remains as-is in our approach.

As soon as you turn on Data Exfiltration Protection (DEP), the lockdown is real. Click through to see what the process of exfiltrating data through an approved mechanism looks like.

Comments closed

“Warming Up” Databricks Clusters

Ust Oldfield needs that cluster to be up:

Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable.
For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00. Similarly, there was also need from business users for their SQL Warehouse clusters to be available for when business started trading so that their BI reports didn’t timeout waiting for the clusters to start.

Read on to see one way to solve this problem without having a cluster run 24/7.

Comments closed

Python UDFs in Databricks SQL

Martin Grund, et al, announce a new preview feature in Databricks:\

To define the Python UDF, all you have to do is a CREATE FUNCTION SQL statement. This statement defines a function name, input parameters and types, specifies the language as PYTHON, and provides the function body between $$.

The function body of a Python UDF in Databricks SQL is equivalent to a regular Python function, with the UDF itself returning the computation’s final value. Dependencies from the Python standard library and Databricks Runtime 10.4, such as the json package in the above example, can be imported and used in your code. You can also define nested functions inside your UDF to encapsulate code to build or reuse complex logic.

I think my biggest concern here would be performance, though I say that without having used the feature.

Comments closed

Pre-Processing Data Explorer Data with Spark

Hauke Mallow does some data engineering:

We often see customer scenarios where historical data has to be migrated to Azure Data Explorer (ADX). Although ADX has very powerful data-transformation capabilities via update policies, sometimes more or less complex data engineering tasks must be done upfront. This happens if the original data structure is too complex or just single data elements being too big, hitting data explorer limits of dynamic columns of 1 MB or maximum ingest file-size of 1 GB for uncompressed data (see also Comparing ingestion methods and tools) .

Let’s think about an Industrial Internet-of-Things (IIoT) use-case where you get data from several production lines. In the production line several devices read humidity, pressure, etc. The following example shows a scenario where a one-to-many relationship is implemented within an array. With this you might get very large columns (with millions of device readings per production line) that might exceed the limit of 1 MB in Azure Data Explorer for dynamic columns. In this case you need to do some pre-processing.

Click through to see how you can do this with an Azure Synapse Analytics Spark pool prior to ingesting it with a Data Explorer pool.

Comments closed

Data Sharing and Secure Cleanrooms in Databricks

Craig Porteous reviews a couple of announcements from Data + AI Summit:

Having worked with many organisations across different industries and sectors, the sharing of data with partners and vendors is always a pain point and one that all too often results in both parties not quite getting what they want or need. This isn’t restricted to my experience however which is why Databricks announced Delta Sharing back at DATA + AI Summit 2021.

Coming to this year’s conference, Delta Sharing has been established as the foundation for many new features with the announcement Databricks Marketplace and Cleanrooms for example, both built upon the Delta Sharing protocol. We’ll explore Cleanrooms below and I’ll look at the Databricks Marketplace in it’s own post.

Read on for Craig’s thoughts on two of the bigger announcements at this year’s summit.

Comments closed

Mounting Data Lake Storage from a Spark Pool

Kamil Nowinski runs into some trouble:

Last weekend, I played a bit with Azure Synapse from a way of mounting Azure Data Lake Storage (ADLS) Gen2 in Synapse notebook within API in the Microsoft Spark Utilities (MSSparkUtils) package. I wanted to just do a simple test, hence I followed the documentation from Microsoft: How to use file mount/unmount API in Synapse.
Having an ADLS Account already created in a subscription – should be easy peasy, right?

Read on to understand when things might be a little more complicated than they seem. And more frustrating, once you see the cause of the problem.

Comments closed

Summarizing Data & AI Summit Announcements

Zach Stagers hits the high notes:

One of the biggest cheers of the keynote was that Delta is being fully open sourced! Databricks continue to share their incredible work to help drive our industry forward. Delta already has wide adoption, but with the open sourced version now being levelled up to the same standard as the ‘proprietary’ one, this should help cement it as the default choice for lake-based storage.

There were some announcements of things to come with Delta too, such as a optimised deletes and updates by removing single rows instead of having to completely rewrite the file. It’ll be really interesting to see how this works, and just how much it boosts performance.

Read on for more notes on several big announcements.

Comments closed

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed