Press "Enter" to skip to content

Category: Warehousing

Creating a Microsoft Fabric Warehouse with Service Principal

Gilbert Quevauvilliers sets up a new warehouse:

In this blog post I am going to show you how to create a Microsoft Fabric Warehouse, where the owner will be the Service Principal.

As mentioned in the blog post here are some of the advantages of having the Service Principal as the Warehouse Owner.

  • Using a Service Principal to create the warehouse avoids issue where the person who created the warehouse leaves the organization and issues arise when the users account is deleted from Entra ID.
  • You avoid the painful logging in with the user account to ensure the password remains updated.
  • The organization now owns the warehouse and not an individual user.

I will show you how I created a Warehouse with the owner being a Service Principal this using a Microsoft Fabric Notebook

Click through for the notebook and additional commentary.

Leave a Comment

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Leave a Comment

Thoughts on Primary and Foreign Key Constraints

Rob Farley lays out an argument:

I am NOT suggesting that data integrity is irrelevant. Not at all. But how often do we need an enforced primary key or foreign key?

Be warned – I’m not actually going to come to a conclusion in this post. I’m going to make cases for both sides, and let you choose where you stand. The T-SQL Tuesday topic this month is Integrity, so some other people might have written on a similar topic, and produce even more compelling arguments one way or another. I’m the host this time, so you’ll be able to find the round-up (once it’s there) here on the LobsterPot Solutions site too.

I will come to a conclusion and it is that OLTP systems need primary and foreign key constraints to work properly. In the post, Rob asks a question around the last time I saw a key violation error in my application. The good(?) news is that I have plenty of them in the last application I built on SQL Server, because I need to rely on a source system that dumps data and doesn’t actually check to see if existing records were there already. That means I can’t simply perform an inner join from my table to the source table, because I could get multiple records back. No, instead, I need to use a common table expression or APPLY operator, retrieve the max values from the flotsam and jetsam that exists, and make my code harder to follow and perform worse as a result.

Distributed warehousing systems don’t have enforceable keys because of the technical challenge of enforcing keys without having different nodes talk to each other. But these things also assume either that you’ve pre-validated all of the data (like in a Kimball model), that you don’t care about duplicate records or messiness, or that you’ll fix the problem again somewhere downstream. Which, in the case of Microsoft Fabric, is typically necessary by the time you put the data into a semantic model, as those things really don’t like duplicate records and this tends to mess up relationships between tables.

Leave a Comment

Cloning a Table in Snowflake

Kevin Wilkie creates a clone:

Recently, I was asked to compare data from a table in Snowflake with the same table’s data from a few hours before. As always, no one had thought about copying the data into another table before they started working with it. They just remembered an approximate time they started working with the table — 10 AM — and the table’s name — Public.WorkingTable. So, let’s see what we can do, shall we?

Read on for the process, as well as circumstances in which cloning might fail.

Leave a Comment

Data Lakes, Warehouses, and Lakehouses

Noa Shavit disambiguates three terms:

data warehouse is a repository and platform for storing, querying, and manipulating data. Warehouses are particularly suited for structured data used for decision support and business intelligence. Modern data warehouses have become more efficient, flexible, and scalable (particularly in the context of massively parallel processing and distributed computation), but they still bear the mark of their early development in the previous century.

The data warehouse concept dates back to data marts in the 1970s. After a long incubation period, the idea began to bear fruit commercially at IBM in the late 1980s and early 1990s. Data warehousing improved on the inefficiency of data marts, siloed data stores maintained by individual departments. 

Click through to learn more about each of the three concepts and how they relate.

Leave a Comment

Cloning Schemas in Snowflake

Kevin Wilkie makes a smaller clone:

One table at a time. Now I know what you’re thinking “Copying some of my databases would take forever…”

Well, don’t fret, my friends! Snowflake has a way!

Click through to see how you can clone an entire schema, including relevant tables, views, functions, procedures, etc. Kevin also specifies what types of objects do not copy, so check that out.

Comments closed

Table Cloning in Snowflake

Kevin Wilkie creates a clone:

In this coding scenario, I am copying everything from TableA and pushing it into a new table called TableB in the same database and schema.

If you check the size of the database before and after you clone a table, it will be the same size – no matter the size of TableA. This is because, at this point in time, TableB exists only as a “pointer” to the data that constitutes TableA. It is not until something changes in one of the tables – say adding a row to TableA, that it stops being a “pointer” and is artificially constituted.

Read on to learn more about how this works.

Comments closed

E-Mailing Query Results in Snowflake

Kevin Wilkie gussies up an e-mail:

In our last post, we discussed the most basic of all ways how in Snowflake you can send out a basic email. It was pretty simple, straight text – nothing to really grab the attention of our readers – which we know is the way to craft an email, right?

To do this, we’re going to have some fun in Python. Yes, delve deep into your bag of Python tricks as we get up to some shenanigans with Snowflake and Python.

Read on for a procedure to e-mail the prior result set in HTML format.

Comments closed

Hyperthreading and Warehouses

Joey D’Antoni covers the impact of hyperthreading on warehouse performance:

Database performance, especially in the cloud, is always a big concern. Traditionally, cloud storage has not performed as well as high-end on-premises storage. The introduction of faster cloud storage (in Azure Premium V2 and Ultra Disk, in AWS IO1-3) and VMs with more available storage bandwidth has improved overall storage performance. However, storage is only sometimes the complete picture.

Read on for Joey’s tests and findings.

Comments closed