Warehousing – Page 11

Against Keys in Fact Tables

Published 2023-05-18 by Kevin Feasel

Marc Lelijveld searches for keys under the lamppost:

Another blog post based on recent client experiences. Last week, I visited a client where we had extensive discussions on data model optimization. As you might know, data modeling in Power BI is one of my favorite topics, so I had an excellent day. It’s also not the first time that I blog about anything data modeling and optimization. If you haven’t read it yet, I recommend reading my previous blog on this topic.

This blog will focus on the need of keys in your tables and primarily your fact tables in your data model. I keep running into data models at customers which are flooded with keys in all tables. For each of them you should ask, do I really need this and could I save it in a different data type for further optimization. In this blog, I will further elaborate on keys in your data model, typical use cases and how these cases can be solved in different manners.

Read the whole thing. The really short version is classic Kimball-style advice: keys for dimensions, not for facts. And in Power BI, removing a unique column from a fact table can speed things up by shrinking the compressed fact table size.

Comments closed

NTILE and QUALIFY in Snowflake

Published 2023-05-10 by Kevin Feasel

Kevin Wilkie continues to build a sample:

In our last post, we went over one way to get a sample of data. In the end was it right? Heck, no – at least not if we wanted a percentage of rows returned. Now, SAMPLE does work fine if you want a specific number of random rows returned each time.

But, let’s face it – sometimes we will be asked for 10% of a table – especially in this world of Machine Learning. So let’s attempt to find a way to make this happen, shall we?

QUALIFY would be a nice clause to have. Instead, if you’re in the T-SQL world, you’re probably used to creating a common table expression or subquery and then applying a WHERE clause to the outputs of that CTE/subquery.

Comments closed

Slowly-Changing Dimensions in the Serverless SQL Pool

Published 2023-05-05 by Kevin Feasel

Lilliam Leme is building a serverless warehouse:

As organizations continue to collect and store large volumes of data in their data lakes, managing this data effectively becomes increasingly important. One key aspect of this is implementing Slow Change Dimension type 2, which allows organizations to track historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In this blog post we will address the following scenario: a customer wants to implement Slow Change Dimension type 2 on top of their data lake.

For this example, we will use Serverless SQL Pool to demonstrate how this can be done. Additionally, in the next post, we will explore how the same approach can be used with Spark.

This turns out to be more work than a classic SQL Server-based solution because of the fact that the serverless SQL pool is read-only, save for CETAS statements.

Comments closed

Sampling and Inconsistent Result Counts

Published 2023-04-27 by Kevin Feasel

Kevin Wilkie does the math:

One of the things you may have noticed after reading our last post on Top (found here) is that sometimes SAMPLE doesn’t give the answer you want.

For example, we can run the same query to get 20% of the table. Remember that this table has 290 rows in total.

After seeing two runs return 69 and then 50 rows, respectively, Kevin digs in and finds out why. This got me thinking about whether a one-pass scan, assigning values based on a uniform distribution (which sounds like what is happening here) would be faster than random sampling without replacement over an array of 8-byte pointers, but then I realized that it’s way too early in the morning for me to be thinking architecture.

Comments closed

Snowflake Data Governance

Published 2023-04-24 by Kevin Feasel

Enrique Lopez de Lara shares a few ways that Snowflake allows us to protect data in its system:

The role hierarchy in the previous section defines what can be done on different objects and by whom. However, it doesn’t restrict which records within a table a user can see or which values should be masked within a column. That’s where the data governance policies in this section come into play.

All data governance policies and tags are stored in the PROD_DB_GOV database under three schemas: MASKING, ROWACCESS and TAGS. Putting all the policies and tags in a single database allows us to centralize them and better restrict access to them. Please note that only the GOV_ADMIN role has read/write permissions on it.

These are, for the most part, very similar to what we’re used to in relational databases: application and system roles, row-level security, and data classification.

Comments closed

Retrieving Redshift Query History

Published 2023-04-21 by Kevin Feasel

Koen Verbeeck wants to see what you did last summer:

Because my Windows machine apparently decides to install updates over night (and thus reboot my machine), it has happened that I lost the query that I was writing for Redshift in the tool DBeaver. When you work with SQL Server Management Studio (SSMS), you typically don’t have this issue as a temporary copy is always saved. Close down SSMS, restart it and the queries are still there.

Click through to see what you can do.

Comments closed

TOP in Snowflake

Published 2023-04-19 by Kevin Feasel

Kevin Wilkie covers the TOP operator and the SAMPLE function in Snowflake:

Nothing flashy or fancy about this. But notice in SQL Server, especially in the newer versions, it wants me to use paranthesis around the 100. I try that in Snowflake and I get this ugly message:

Bringing this back to SQL Server, yeah, use the parentheses around TOP, like TOP(25) instead of TOP 25. It’s better that way.

Comments closed

Checklist for a Snowflake Migration

Published 2023-04-12 by Kevin Feasel

Sandeep Arora has a checklist for us:

We have broken our Snowflake Migration Checklist into nine phases to help plan and execute an end-to-end migration of the existing traditional data platform to Snowflake. These phases will help align migration resources and efforts; however, this doesn’t necessarily mean that all steps should be executed sequentially. Some phases, like “Train Users,” can be executed parallel to other phases.

At a high level, the process isn’t Snowflake-specific—really, 6 of the 9 steps are generic supporting steps which would apply to any major project. This makes the checklist not only a good starting point for a Snowflake migration, but also any major migration project.

Comments closed

Unpivoting Data in Hive

Published 2023-04-04 by Kevin Feasel

The Big Data in Real World team does a bit of data reshaping:

Let’s say we have a table name employee_multiple_depts and each employee in the table is mapped to 3 departments – dept1, dept2 and dept3.

What we need is to transpose or convert each department into a row for each employee.

Click through to see how you can do it in Hive.

Comments closed

Tracking Change Events in Snowflake

Published 2023-03-31 by Kevin Feasel

Kevin Wilkie shows off an interesting window function:

Notice that it has the OVER operator, you can order the data, and even partition the data as needed (Not seen in this example)!

But, as usual with Snowflake, there are even more functions we can work with! Sometimes, you just need to know when items are changed. Enter the CONDITIONAL_CHANGE_EVENT windowing function!

Click through for an example of how CONDITIONAL_CHANGE_EVENT() works.

Comments closed

Category: Warehousing