Gauri Mahajan takes us through three distribution techniques when working with Azure Synapse Analytics dedicated SQL pool tables:
Data warehouses host much larger volumes of data compared to transactional databases, the volume of reads is much more compared to writes and queries tend to result in much larger result sets compared to queries that retrieve scalar values or paginated record sets from transactional databases. Due to this nature of data warehouses, there is a higher impetus on the server to perform faster. Modern data warehouses like AWS Redshift, Azure Synapse, Snowflake and others employ approaches like data sharding where data is distributed horizontally on multiple nodes which process data in parallel. This approach is highly scalable as nodes can be easily added to a data cluster as the storage and performance need increases. One key aspect that is different for tables hosted on such data warehouses is that tables are distributed horizontally using different distribution algorithms, so that all the nodes in an Azure Synapse cluster have an equal share of responsibility for hosting, processing, and delivering data for any given query to maximize performance.
In this article, we will learn about the table distribution styles supported in an Azure Synapse and how to use them for creating distributed tables.
Read on to learn more. This is an example of something we don’t think about on the SQL Server side, so when moving to Azure Synapse Analytics dedicated SQL pools, it can be easy to get this wrong and end up with sub-optimal performance.