Press "Enter" to skip to content

Category: Warehousing

Bucketing Data in Hive

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node

In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Leave a Comment

Static versus Dynamic Partitioning in Hive

The Hadoop in Real World team explains the difference between two partitioning strategies:

The difference between static and dynamic partitioning only exists when the partition is being created based on how the partitions are added to the table. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.

Click through for the difference.

Leave a Comment

Dedicated SQL Pool Index, Distribution, and Partition Guidance

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Synapse vs Snowflake

Travis Manning has a throw-down:

Data warehousing has become a hot topic for most organizations as data volume grows exponentially, and yet the capacity to manually manage it all but diminishes. The ecosystem is replete with options, each with a host of features and integrations. In this article, we will discuss two of the most common (and commonly discussed!) data warehousing services, Azure Synapse and Snowflake Data Warehouse (DW). For this article, we will try to focus on use cases, and which option is appropriate in that context.

Click through for the product comparison. One big difference not covered is pricing uncertainty. If you have a good understanding of the number of executions and computational complexity of your queries, as well as data quantities, Snowflake can be very competitively priced. But what can happen is that the competitive price turns into a much-less-competitive price by the time you’re fully up to speed.

1 Comment

Distribution Techniques in Azure Synapse Analytics

Gauri Mahajan takes us through three distribution techniques when working with Azure Synapse Analytics dedicated SQL pool tables:

Data warehouses host much larger volumes of data compared to transactional databases, the volume of reads is much more compared to writes and queries tend to result in much larger result sets compared to queries that retrieve scalar values or paginated record sets from transactional databases. Due to this nature of data warehouses, there is a higher impetus on the server to perform faster. Modern data warehouses like AWS Redshift, Azure Synapse, Snowflake and others employ approaches like data sharding where data is distributed horizontally on multiple nodes which process data in parallel. This approach is highly scalable as nodes can be easily added to a data cluster as the storage and performance need increases. One key aspect that is different for tables hosted on such data warehouses is that tables are distributed horizontally using different distribution algorithms, so that all the nodes in an Azure Synapse cluster have an equal share of responsibility for hosting, processing, and delivering data for any given query to maximize performance.

In this article, we will learn about the table distribution styles supported in an Azure Synapse and how to use them for creating distributed tables.

Read on to learn more. This is an example of something we don’t think about on the SQL Server side, so when moving to Azure Synapse Analytics dedicated SQL pools, it can be easy to get this wrong and end up with sub-optimal performance.

Comments closed

Type 1 SCDs in Delta Lake

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.

Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:

Check out the repo, but be sure to read the whole post.

Comments closed

Role-Based Access Control in Snowflake

Warner Chaves explains how role-based access controls work in Snowflake:

The data access privilege granularity is the lowest level of securable that you will use to provide data access. This can theoretically go all the way down to rows and all the way up to full databases. 

I usually recommend that people start out with using Schema as their data access securable granularity. Database is usually too broad and you will inevitably have to re-do your roles and table level. Below is too specific to turn it into a general methodology—you would end up with way too many roles. See the FAQ later in this post on how to mix and match granularities if needed.

Once you have the granularity defined, you then create back-end roles at that level.

Read on to see what those roles look like. It’s a pretty standard RBAC setup.

Comments closed

Attributing Redshift Costs to Users

Jason Pedreza, et al, show how you can break down query utilization by user in an Amazon Redshift database:

At its simplest form, cost attribution can be determined using the amount of the storage assigned to the individual objects using the ownership of the objects to the groups. But the downside of this approach is it doesn’t provide a true translation of the resource usage. For example, let’s say Team 1 has total object size of 1 TB, whereas Team 2 has 100 GB in total size. Team 1 member runs 10 queries daily, and Team 2 runs 1,000 queries per day. Of course, Team 2 uses more resources than Team 1.

The Amazon Redshift RA3 architecture allows you to pay for the compute and data warehouse storage capacity separately, therefore storage doesn’t reflect the resources used by the teams for the cost attribution.

Click through to see how.

Comments closed

Row Pattern Recognition in Snowflake

Koen Verbeeck knows how to make my blood boil:

I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.

In the book T-SQL Window Functions – For data analysis and beyond, Itzik Ben-Gan explains the concept of row-pattern recognition (RPR) in a dedicated chapter (you can find a full book review here). It’s a concept that doesn’t exist in T-SQL, but is described in the SQL standard and is available in some other database systems. Snowflake has recently introduced support for RPR. 

Jokes about being angry aside, I’d really like to see row pattern recognition in SQL Server. It’s definitely not trivial to learn, but once you do, there’s a lot of power available to you. Koen also links to the Feedback item about this, so vote on that as well.

Comments closed

Diving into Data Mesh

Ust Oldfield has a series on data mesh. First up is a primer on the topic:

What is a Data Mesh? Data Mesh is new and was introduced in a few articles by Zhamak Dehghani, starting in May 2019. The first article, How To Move Beyond a Monolithic Data Lake to a Distributed and Data Mesh, and the second article, Data Mesh Principles and Logical Architecture, form the foundational thought pieces on data meshes. Very simply, and crudely, a Data Mesh aims to overcome the deficiencies of previous generations of analytical architectures by decentralising the ownership and production of analytical data to the teams who own the data domain. It is a convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.

Essentially applying the learnings from operational systems, of applying domain driven design, so that ownership of data is domain oriented. For example, a CRM team will own the Customer domain and all the data within it. Instead of data from all domains flowing into a central data repository, the individual domains curate and serve their datasets in an easily consumable way. A much closer alignment between operational and analytical data.

Then Ust has a deep dive:

A foundational principle of data mesh is the decentralisation of ownership of data to those closest to it. Those closest to it are those in the business, often using the operational systems and also using analytical data. Responsibility and ownership of the data is devolved from a central function to business units and domains. Therefore any change to how a business domain organises itself is limited to the domain itself rather than impacting the entire organisation. This is referred to as the bounded domain context.

Teams responsible for the business operation, e.g. CRM team responsible for Customers, are also responsible for the ownership and serving of the analytical data relating to their domain.

I probably need to spend more time thinking about it, but this feels like a rickety ladder of incentive compatibility problems. I agree with Ust’s characterization of warehouses as staid and that relatively little has happened since the Kimball model came to prominence, but this sounds like taking the problems of the Kimball model (slow to change, an iceberg of code & ETL, etc.) and effectively saying “Hey, here are slightly different views of the source systems, have fun” instead. That way, everybody gets to experience the joy of non-conforming dimensions, disagreeing facts, working with data in different grains, and seven terms for the same thing.

Comments closed