SQL Data Warehouse Distribution Keys

Simon Whiteley explains the different distribution key options available in Azure SQL Data Warehouse and SQL Server APS:

Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn’t always a problem!

If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it’s grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn’t be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!

However, the issues arise when we have multiple tables in our query. In order to join two tables. Let’s take a very simple join between a fact table and a dimension. I’ve shown 6 distributions for simplicity, but this would be happening across all 60.

Figuring out which distribution key to use can make a huge difference in performance.

Related Posts

Checking Azure Status

Arun Sirpal shows where to look if you think you’re experiencing an Azure SQL Database outage: It shows the many different layers involved with a product like Azure SQL Database. What happens if there is a loss of service for a specific component? ¬†Obviously we as customers would not be able to fix the issue […]

Read More

Azure Data Lake Store File Management With httr

Leila Etaati shows how to generate RESTful statements in R using httr: In this post, I am going to share my experiment in how to do file management in ADLS using R studio, to do this you need to have below items 1. An Azure subscription 2. Create an Azure Data Lake Store Account 3. […]

Read More

Categories