Warehousing – Page 11

Sampling and Inconsistent Result Counts

Published 2023-04-27 by Kevin Feasel

One of the things you may have noticed after reading our last post on Top (found here) is that sometimes SAMPLE doesn’t give the answer you want.

For example, we can run the same query to get 20% of the table. Remember that this table has 290 rows in total.

After seeing two runs return 69 and then 50 rows, respectively, Kevin digs in and finds out why. This got me thinking about whether a one-pass scan, assigning values based on a uniform distribution (which sounds like what is happening here) would be faster than random sampling without replacement over an array of 8-byte pointers, but then I realized that it’s way too early in the morning for me to be thinking architecture.

Comments closed

Snowflake Data Governance

Published 2023-04-24 by Kevin Feasel

Enrique Lopez de Lara shares a few ways that Snowflake allows us to protect data in its system:

The role hierarchy in the previous section defines what can be done on different objects and by whom. However, it doesn’t restrict which records within a table a user can see or which values should be masked within a column. That’s where the data governance policies in this section come into play.

All data governance policies and tags are stored in the PROD_DB_GOV database under three schemas: MASKING, ROWACCESS and TAGS. Putting all the policies and tags in a single database allows us to centralize them and better restrict access to them. Please note that only the GOV_ADMIN role has read/write permissions on it.

These are, for the most part, very similar to what we’re used to in relational databases: application and system roles, row-level security, and data classification.

Comments closed

Retrieving Redshift Query History

Published 2023-04-21 by Kevin Feasel

Koen Verbeeck wants to see what you did last summer:

Because my Windows machine apparently decides to install updates over night (and thus reboot my machine), it has happened that I lost the query that I was writing for Redshift in the tool DBeaver. When you work with SQL Server Management Studio (SSMS), you typically don’t have this issue as a temporary copy is always saved. Close down SSMS, restart it and the queries are still there.

Click through to see what you can do.

Comments closed

TOP in Snowflake

Published 2023-04-19 by Kevin Feasel

Kevin Wilkie covers the TOP operator and the SAMPLE function in Snowflake:

Nothing flashy or fancy about this. But notice in SQL Server, especially in the newer versions, it wants me to use paranthesis around the 100. I try that in Snowflake and I get this ugly message:

Bringing this back to SQL Server, yeah, use the parentheses around TOP, like TOP(25) instead of TOP 25. It’s better that way.

Comments closed

Checklist for a Snowflake Migration

Published 2023-04-12 by Kevin Feasel

Sandeep Arora has a checklist for us:

We have broken our Snowflake Migration Checklist into nine phases to help plan and execute an end-to-end migration of the existing traditional data platform to Snowflake. These phases will help align migration resources and efforts; however, this doesn’t necessarily mean that all steps should be executed sequentially. Some phases, like “Train Users,” can be executed parallel to other phases.

At a high level, the process isn’t Snowflake-specific—really, 6 of the 9 steps are generic supporting steps which would apply to any major project. This makes the checklist not only a good starting point for a Snowflake migration, but also any major migration project.

Comments closed

Unpivoting Data in Hive

Published 2023-04-04 by Kevin Feasel

The Big Data in Real World team does a bit of data reshaping:

Let’s say we have a table name employee_multiple_depts and each employee in the table is mapped to 3 departments – dept1, dept2 and dept3.

What we need is to transpose or convert each department into a row for each employee.

Click through to see how you can do it in Hive.

Comments closed

Tracking Change Events in Snowflake

Published 2023-03-31 by Kevin Feasel

Kevin Wilkie shows off an interesting window function:

Notice that it has the OVER operator, you can order the data, and even partition the data as needed (Not seen in this example)!

But, as usual with Snowflake, there are even more functions we can work with! Sometimes, you just need to know when items are changed. Enter the CONDITIONAL_CHANGE_EVENT windowing function!

Click through for an example of how CONDITIONAL_CHANGE_EVENT() works.

Comments closed

Join Operations in BigQuery

Published 2023-03-31 by Kevin Feasel

Rathish Kumar joins a few tables together:

SQL joins are used to combine columns from multiple tables to get desired result set. In a typical Relational model we use normalized tables, each table represents an entity (example: employee, department, etc) and its relationships and when we need to get data from more than one tables, for example employee name and employee department, we use joins to combine employee name column from employee table, department name column from department table based on employee number key column, which is available on both the tables.

Similarly, typical data warehouse setup follows Star or Snowflake schema consisting of a primary fact table and satellite dimension tables. Fact tables represents events (example: orders table in a ecommerce business) and dimension table represents attributes and slowly changing information (example: customer, product tables).

The syntax is rather similar to most database engines, though there are a few physical join operators which differ from typical relational database management systems. Also, I’ll take this moment to say thank you to Rathish for not using Venn diagrams to show joins and instead using a proper technique.

Comments closed

Landing Zone Layouts for Modern Data Warehouses

Published 2023-03-27 by Kevin Feasel

Paul Hernandez builds out a landing zone for a warehouse:

In this article I want to discuss some different layout options for a landing zone in a modern cloud data warehouse architecture. With landing zone, I mean a storage account where raw data lands directly from its source system (not to be confused with a landing zone to move a system or application into the cloud).

One of the things I appreciate a lot about this post is that it covers the history, showing us how we got to where we are. Paul’s well-versed in each step along the way and lays things out clearly.

Comments closed

End of Month in Snowflake and SQL Server

Published 2023-03-15 by Kevin Feasel

Kevin Wilkie is ready for that end-of-month paycheck:

When you work with data, you’ll probably need to work with dates at least once a month. That is the nature of the beast. Today, let’s compare working with them in SQL Server and Snowflake. I want to focus only on adding and subtracting months when provided with a specific day.

Along the way, I would also push for a calendar table, so that you can remove some of the more difficult (or even most common) date calculations.

Comments closed

Category: Warehousing