Press "Enter" to skip to content

Category: Warehousing

Foreign Key Relationships in Microsoft Fabric Data Warehouses

Jared Westover looks at key constraints:

In late 2024, I noticed a comment on the Microsoft Learn site stating that foreign keys could improve query performance on tables in a Fabric warehouse. That claim immediately caught my attention. I wanted to answer a simple question: Do relationships help, hurt, or have no effect when added to tables in a Fabric warehouse?

Let’s get more specific—do foreign keys improve query performance when reading data (not loading)? In other words, do they make queries run faster?

Sadly, the answer is not as promising as with SQL Server. But this also makes sense considering the distributed nature of Fabric data warehouses.

Leave a Comment

Asynchronous SQL Statement Execution in Snowflake

Koen Verbeeck doesn’t want to wait for an answer:

It’s been a while since I blogged about Snowflake, but a recent LinkedIn post caught my attention: the ability to add asynchronous execution of SQL statements in a stored procedure. In other words: parallel execution of SQL statements. This got me excited, because in my opinion this is something that has been missing in T-SQL since forever. Every time you want to do something in parallel, you need to use external tools to accomplish this in SQL Server (or Azure SQL DB, or Fabric Warehouse, or Fabric SQL DB, or … you get the point). You needed to use SQL Server Agent Jobs, or SSIS packages, or Azure Data Factory and so on.

Snowflake introduces the ASYNC and AWAIT keywords, which can be used to trigger asynchronous execution. 

Read on for a very simple example and some thoughts from Koen. Aside from possibly making data modifications faster (assuming there are no constraint checks), I’m not quite sure what the major benefit to this is. I’d generally use asynchronous calls to support UI operations, letting a calling application respond to user input while some background thread processes data. But I’m not positive what you get from pushing async/await logic into the database itself.

Leave a Comment

Spatial Queries in Fabric Data Warehouse

Jovan Popvic reads a map:

Spatial data has become increasingly important in various fields, from urban planning and environmental monitoring to transportation and logistics. Fabric Data Warehouse offers spatial functionalities that enable you to query and analyze spatial data efficiently.

In this blog post, we will delve into the spatial capabilities in the Fabric Data Warehouse and demonstrate how to use the spatial functions in your queries.

This looks a bit like the way we perform spatial operations in SQL Server. Jovan shows off some examples of functionality, so check that out.

Leave a Comment

Notes on Change Tracking for Warehouse Incremental Loads

Meagan Longoria shares some hard-earned experience:

I have a few clients that incrementally load tables from a SQL Server source into their data warehouse or lakehouse by using change tracking. Lately, they encountered some issues with changes to the configuration and the data in the source database, so I decided to share some things you can check before using change tracking as part of your ETL load or when troubleshooting your data load.

Click through for three common issues you may run into while using change tracking.

Comments closed

Creating a Microsoft Fabric Warehouse with Service Principal

Gilbert Quevauvilliers sets up a new warehouse:

In this blog post I am going to show you how to create a Microsoft Fabric Warehouse, where the owner will be the Service Principal.

As mentioned in the blog post here are some of the advantages of having the Service Principal as the Warehouse Owner.

  • Using a Service Principal to create the warehouse avoids issue where the person who created the warehouse leaves the organization and issues arise when the users account is deleted from Entra ID.
  • You avoid the painful logging in with the user account to ensure the password remains updated.
  • The organization now owns the warehouse and not an individual user.

I will show you how I created a Warehouse with the owner being a Service Principal this using a Microsoft Fabric Notebook

Click through for the notebook and additional commentary.

Comments closed

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Thoughts on Primary and Foreign Key Constraints

Rob Farley lays out an argument:

I am NOT suggesting that data integrity is irrelevant. Not at all. But how often do we need an enforced primary key or foreign key?

Be warned – I’m not actually going to come to a conclusion in this post. I’m going to make cases for both sides, and let you choose where you stand. The T-SQL Tuesday topic this month is Integrity, so some other people might have written on a similar topic, and produce even more compelling arguments one way or another. I’m the host this time, so you’ll be able to find the round-up (once it’s there) here on the LobsterPot Solutions site too.

I will come to a conclusion and it is that OLTP systems need primary and foreign key constraints to work properly. In the post, Rob asks a question around the last time I saw a key violation error in my application. The good(?) news is that I have plenty of them in the last application I built on SQL Server, because I need to rely on a source system that dumps data and doesn’t actually check to see if existing records were there already. That means I can’t simply perform an inner join from my table to the source table, because I could get multiple records back. No, instead, I need to use a common table expression or APPLY operator, retrieve the max values from the flotsam and jetsam that exists, and make my code harder to follow and perform worse as a result.

Distributed warehousing systems don’t have enforceable keys because of the technical challenge of enforcing keys without having different nodes talk to each other. But these things also assume either that you’ve pre-validated all of the data (like in a Kimball model), that you don’t care about duplicate records or messiness, or that you’ll fix the problem again somewhere downstream. Which, in the case of Microsoft Fabric, is typically necessary by the time you put the data into a semantic model, as those things really don’t like duplicate records and this tends to mess up relationships between tables.

Comments closed

Cloning a Table in Snowflake

Kevin Wilkie creates a clone:

Recently, I was asked to compare data from a table in Snowflake with the same table’s data from a few hours before. As always, no one had thought about copying the data into another table before they started working with it. They just remembered an approximate time they started working with the table — 10 AM — and the table’s name — Public.WorkingTable. So, let’s see what we can do, shall we?

Read on for the process, as well as circumstances in which cloning might fail.

Comments closed