Press "Enter" to skip to content

Category: Warehousing

Recursive Common Table Expressions in Snowflake

Kevin Wilkie is too fancy for simple joins:

Today, I want to talk about that fun edge case when you’re having to join a table to itself in Snowflake. Does it happen often? Not unless your architect just hates you.

Let’s use the normal pieces of data that everyone uses for this kind of thing – employee/manager relationships. We have our employee table that we’ve been working off that we’ll play with for this example.

The syntax is a bit different from T-SQL, but the concept is still the same.

Comments closed

Against Keys in Fact Tables

Marc Lelijveld searches for keys under the lamppost:

Another blog post based on recent client experiences. Last week, I visited a client where we had extensive discussions on data model optimization. As you might know, data modeling in Power BI is one of my favorite topics, so I had an excellent day. It’s also not the first time that I blog about anything data modeling and optimization. If you haven’t read it yet, I recommend reading my previous blog on this topic.

This blog will focus on the need of keys in your tables and primarily your fact tables in your data model. I keep running into data models at customers which are flooded with keys in all tables. For each of them you should ask, do I really need this and could I save it in a different data type for further optimization. In this blog, I will further elaborate on keys in your data model, typical use cases and how these cases can be solved in different manners.

Read the whole thing. The really short version is classic Kimball-style advice: keys for dimensions, not for facts. And in Power BI, removing a unique column from a fact table can speed things up by shrinking the compressed fact table size.

Comments closed

NTILE and QUALIFY in Snowflake

Kevin Wilkie continues to build a sample:

In our last post, we went over one way to get a sample of data. In the end was it right? Heck, no – at least not if we wanted a percentage of rows returned. Now, SAMPLE does work fine if you want a specific number of random rows returned each time.

But, let’s face it – sometimes we will be asked for 10% of a table – especially in this world of Machine Learning. So let’s attempt to find a way to make this happen, shall we?

QUALIFY would be a nice clause to have. Instead, if you’re in the T-SQL world, you’re probably used to creating a common table expression or subquery and then applying a WHERE clause to the outputs of that CTE/subquery.

Comments closed

Slowly-Changing Dimensions in the Serverless SQL Pool

Lilliam Leme is building a serverless warehouse:

As organizations continue to collect and store large volumes of data in their data lakes, managing this data effectively becomes increasingly important. One key aspect of this is implementing Slow Change Dimension type 2, which allows organizations to track historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In this blog post we will address the following scenario: a customer wants to implement Slow Change Dimension type 2 on top of their data lake.

For this example, we will use Serverless SQL Pool to demonstrate how this can be done. Additionally, in the next post, we will explore how the same approach can be used with Spark.

This turns out to be more work than a classic SQL Server-based solution because of the fact that the serverless SQL pool is read-only, save for CETAS statements.

Comments closed

Sampling and Inconsistent Result Counts

Kevin Wilkie does the math:

One of the things you may have noticed after reading our last post on Top (found here) is that sometimes SAMPLE doesn’t give the answer you want.

For example, we can run the same query to get 20% of the table. Remember that this table has 290 rows in total.

After seeing two runs return 69 and then 50 rows, respectively, Kevin digs in and finds out why. This got me thinking about whether a one-pass scan, assigning values based on a uniform distribution (which sounds like what is happening here) would be faster than random sampling without replacement over an array of 8-byte pointers, but then I realized that it’s way too early in the morning for me to be thinking architecture.

Comments closed

Snowflake Data Governance

Enrique Lopez de Lara shares a few ways that Snowflake allows us to protect data in its system:

The role hierarchy in the previous section defines what can be done on different objects and by whom. However, it doesn’t restrict which records within a table a user can see or which values should be masked within a column. That’s where the data governance policies in this section come into play.

All data governance policies and tags are stored in the PROD_DB_GOV database under three schemas: MASKING, ROWACCESS and TAGS. Putting all the policies and tags in a single database allows us to centralize them and better restrict access to them. Please note that only the GOV_ADMIN role has read/write permissions on it.

These are, for the most part, very similar to what we’re used to in relational databases: application and system roles, row-level security, and data classification.

Comments closed

Retrieving Redshift Query History

Koen Verbeeck wants to see what you did last summer:

Because my Windows machine apparently decides to install updates over night (and thus reboot my machine), it has happened that I lost the query that I was writing for Redshift in the tool DBeaver. When you work with SQL Server Management Studio (SSMS), you typically don’t have this issue as a temporary copy is always saved. Close down SSMS, restart it and the queries are still there.

Click through to see what you can do.

Comments closed

Checklist for a Snowflake Migration

Sandeep Arora has a checklist for us:

We have broken our Snowflake Migration Checklist into nine phases to help plan and execute an end-to-end migration of the existing traditional data platform to Snowflake. These phases will help align migration resources and efforts; however, this doesn’t necessarily mean that all steps should be executed sequentially. Some phases, like “Train Users,” can be executed parallel to other phases.

At a high level, the process isn’t Snowflake-specific—really, 6 of the 9 steps are generic supporting steps which would apply to any major project. This makes the checklist not only a good starting point for a Snowflake migration, but also any major migration project.

Comments closed