Warehousing – Page 3

Data Security in Snowflake

Published 2025-03-05 by Kevin Feasel

Anil Kumar Moka locks down a Snowflake instance:

In this practical guide, we’ll explore techniques to help you secure your use of Snowflake:

Foundational Security Setup

Secure Views and Their Critical Role

Row-Level Security Implementation Methods

Dynamic Data Masking Strategies

Encryption and Data Protection

Best Practices and Common Pitfalls

Read on for the full article.

Comments closed

Notes on Change Tracking for Warehouse Incremental Loads

Published 2025-01-31 by Kevin Feasel

Meagan Longoria shares some hard-earned experience:

I have a few clients that incrementally load tables from a SQL Server source into their data warehouse or lakehouse by using change tracking. Lately, they encountered some issues with changes to the configuration and the data in the source database, so I decided to share some things you can check before using change tracking as part of your ETL load or when troubleshooting your data load.

Click through for three common issues you may run into while using change tracking.

Comments closed

Creating a Microsoft Fabric Warehouse with Service Principal

Published 2025-01-16 by Kevin Feasel

Gilbert Quevauvilliers sets up a new warehouse:

In this blog post I am going to show you how to create a Microsoft Fabric Warehouse, where the owner will be the Service Principal.

As mentioned in the blog post here are some of the advantages of having the Service Principal as the Warehouse Owner.

Using a Service Principal to create the warehouse avoids issue where the person who created the warehouse leaves the organization and issues arise when the users account is deleted from Entra ID.

You avoid the painful logging in with the user account to ensure the password remains updated.

The organization now owns the warehouse and not an individual user.

I will show you how I created a Warehouse with the owner being a Service Principal this using a Microsoft Fabric Notebook

Click through for the notebook and additional commentary.

Comments closed

Custom SCD2 with PySpark

Published 2025-01-14 by Kevin Feasel

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Thoughts on Primary and Foreign Key Constraints

Published 2025-01-14 by Kevin Feasel

Rob Farley lays out an argument:

I am NOT suggesting that data integrity is irrelevant. Not at all. But how often do we need an enforced primary key or foreign key?

Be warned – I’m not actually going to come to a conclusion in this post. I’m going to make cases for both sides, and let you choose where you stand. The T-SQL Tuesday topic this month is Integrity, so some other people might have written on a similar topic, and produce even more compelling arguments one way or another. I’m the host this time, so you’ll be able to find the round-up (once it’s there) here on the LobsterPot Solutions site too.

I will come to a conclusion and it is that OLTP systems need primary and foreign key constraints to work properly. In the post, Rob asks a question around the last time I saw a key violation error in my application. The good(?) news is that I have plenty of them in the last application I built on SQL Server, because I need to rely on a source system that dumps data and doesn’t actually check to see if existing records were there already. That means I can’t simply perform an inner join from my table to the source table, because I could get multiple records back. No, instead, I need to use a common table expression or APPLY operator, retrieve the max values from the flotsam and jetsam that exists, and make my code harder to follow and perform worse as a result.

Distributed warehousing systems don’t have enforceable keys because of the technical challenge of enforcing keys without having different nodes talk to each other. But these things also assume either that you’ve pre-validated all of the data (like in a Kimball model), that you don’t care about duplicate records or messiness, or that you’ll fix the problem again somewhere downstream. Which, in the case of Microsoft Fabric, is typically necessary by the time you put the data into a semantic model, as those things really don’t like duplicate records and this tends to mess up relationships between tables.

Comments closed

Creating and Working with an Azure Databricks SQL Warehouse

Published 2025-01-10 by Kevin Feasel

John Miner works a shift in the warehouse:

Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all users are familiar with Python and the PySpark module. How can users with a solid understanding of ANSI SQL be effective in the Databricks environment?

Read on for the answer.

Comments closed

Cloning a Table in Snowflake

Published 2025-01-08 by Kevin Feasel

Kevin Wilkie creates a clone:

Recently, I was asked to compare data from a table in Snowflake with the same table’s data from a few hours before. As always, no one had thought about copying the data into another table before they started working with it. They just remembered an approximate time they started working with the table — 10 AM — and the table’s name — Public.WorkingTable. So, let’s see what we can do, shall we?

Read on for the process, as well as circumstances in which cloning might fail.

Comments closed

Data Lakes, Warehouses, and Lakehouses

Published 2025-01-06 by Kevin Feasel

Noa Shavit disambiguates three terms:

A data warehouse is a repository and platform for storing, querying, and manipulating data. Warehouses are particularly suited for structured data used for decision support and business intelligence. Modern data warehouses have become more efficient, flexible, and scalable (particularly in the context of massively parallel processing and distributed computation), but they still bear the mark of their early development in the previous century.

The data warehouse concept dates back to data marts in the 1970s. After a long incubation period, the idea began to bear fruit commercially at IBM in the late 1980s and early 1990s. Data warehousing improved on the inefficiency of data marts, siloed data stores maintained by individual departments.

Click through to learn more about each of the three concepts and how they relate.

Comments closed

Cloning Schemas in Snowflake

Published 2024-12-19 by Kevin Feasel

Kevin Wilkie makes a smaller clone:

One table at a time. Now I know what you’re thinking “Copying some of my databases would take forever…”

Well, don’t fret, my friends! Snowflake has a way!

Click through to see how you can clone an entire schema, including relevant tables, views, functions, procedures, etc. Kevin also specifies what types of objects do not copy, so check that out.

Comments closed

Table Cloning in Snowflake

Published 2024-11-27 by Kevin Feasel

Kevin Wilkie creates a clone:

In this coding scenario, I am copying everything from TableA and pushing it into a new table called TableB in the same database and schema.

If you check the size of the database before and after you clone a table, it will be the same size – no matter the size of TableA. This is because, at this point in time, TableB exists only as a “pointer” to the data that constitutes TableA. It is not until something changes in one of the tables – say adding a row to TableA, that it stops being a “pointer” and is artificially constituted.

Read on to learn more about how this works.

Comments closed

Category: Warehousing