Spark – Page 3 – Curated SQL

Microsoft Fabric Shortcuts and Lakehouse Maintenance

Published 2025-03-07 by Kevin Feasel

Dennes Torres has a public service announcement:

I wrote about lakehouse maintenance before, about multiple lakehouse maintenances, published videos about this subject and provided sample code about it.

However, there is one problem: All the maintenance execution should be avoided over shortcuts.

The tables require maintenance in their original place. According to our solution advances, we start using shortcuts, lots of them. Our maintenance code should always skip shortcuts and make the maintenance only on the tables.

Click through to see how you can differentiate shortcuts from actual tables and write code to avoid shortcuts.

Comments closed

Spark Connector for Fabric Data Warehouse

Published 2025-02-28 by Kevin Feasel

Arshad Ali announces a connector:

We are pleased to announce the availability of the Fabric Spark connector for Fabric Data Warehouse (DW) in the Fabric Spark runtime. This connector enables Spark developers and data scientists to access and work with data from Fabric DW and the SQL analytics endpoint of the lakehouse, either within the same workspace or across different workspaces, using a simplified Spark API. The connector will be included as a default library within the Fabric Runtime, eliminating the need for separate installation.

Click through to check out its capabilities. This is a tiny step toward where I think Microsoft Fabric should go: any tool accessing the same data, eliminating separate lakehouses vs warehouses and having to remember that you can’t use this syntax in this scenario unless you connect to it this way and sacrifice one live chicken.

Comments closed

Table Compaction in Apache Spark

Published 2025-02-27 by Kevin Feasel

Miles Cole groups things together:

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Comments closed

Object Ownership in Databricks

Published 2025-02-03 by Kevin Feasel

Chen Hirsh shares a tale of woe:

Have you ever made a change in your system and immediately regretted it? A few weeks ago, I did just that while working with a customer on their Databricks platform. His IT guys made some changes, moving a user to another domain. In Databrick, this is considered a new user, so I added the new user and gave him the same permissions as the old user.

And then, without thinking twice, I deleted the old user from Databricks.

Things did not go well from there. Read on to learn what happened, why, and how to avoid this problem in the future.

Comments closed

Working with Unity Catalog

Published 2025-01-31 by Kevin Feasel

Dustin Vannoy has a new video:

Unity Catalog Open Source Software (OSS) is a compelling project and there are some key benefits to working with it locally. In this video I share reason for using the open source project Unity Catalog (UC) and walk through some of the setup and testing I did to create and write to tables from Apache Spark.

Click through for the video, as well as a text summary and script examples.

Comments closed

Emitting Data to a Single CSV in Spark

Published 2025-01-28 by Kevin Feasel

Chen Hirsh wants to consolidate:

To write and read data faster, Spark splits the work between nodes in a cluster, each reading\writing part of the data. That’s why, in the screenshot above, there are 3 CSV files (That’s the files starting with “Part”, with a CSV extension), instead of 1. Note that this can also occur when working with a single node cluster since Spark splits the work into tasks.

This behavior is great if you intend to keep working with the CSV files in Databricks since reading will be faster. But if you want to share this file with someone outside of Databricks, this may be inconvenient.

Read on for two ways of doing this, as well as the price you pay to get it done.

Comments closed

Data Masking in Azure Databricks

Published 2025-01-14 by Kevin Feasel

Rayis Imayev hides some information:

One way to protect sensitive information from end users in a database is through dynamic masking. In this process, the actual data is not altered; however, when the data is exposed or queried, the results are returned with modified values, or the actual values are replaced with special characters or notes indicating that the requested data is hidden for protection purposes.

In this blog, we will discuss a different approach to protecting data, where personally identifiable information (PII – a term you will frequently encounter when reading about data protection and data governance) is actually changed or updated in the database / persistent storage. This ensures that even if someone gains access to the data, nothing will be compromised. This is usually needed for refreshing the production database or dataset containing PII data elements to a lower environment. Your QA team will appreciate having a realistic data volume that resembles production environment but with masked data.

Rayis goes into depth on the process. I could also recommend checking out the article on row filters and column masks for more information.

Comments closed

Custom SCD2 with PySpark

Published 2025-01-14 by Kevin Feasel

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Creating and Working with an Azure Databricks SQL Warehouse

Published 2025-01-10 by Kevin Feasel

John Miner works a shift in the warehouse:

Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all users are familiar with Python and the PySpark module. How can users with a solid understanding of ANSI SQL be effective in the Databricks environment?

Read on for the answer.

Comments closed

Session, DataFrameWriter, and Table Configurations in Spark

Published 2024-12-24 by Kevin Feasel

Miles Cole makes a configuration change:

With Spark and Delta Lake, just like with Hudi and Iceberg, there are several ways to enable or disable settings that impact how tables are created. These settings may affect data layout or table format features, but it can be confusing to understand why different methods exist, when each should be used, and how property inheritance works.

While platform defaults should account for most use cases, Spark provides flexibility to optimize various workloads, whether adjusting for read or write performance, or for hot or cold path data processing. Inevitably, the need to adjust configurations from the default will arise. So, how do we do this effectively?

Read on to learn how.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Category: Spark