Press "Enter" to skip to content

Month: January 2025

Creating a Microsoft Fabric Warehouse with Service Principal

Gilbert Quevauvilliers sets up a new warehouse:

In this blog post I am going to show you how to create a Microsoft Fabric Warehouse, where the owner will be the Service Principal.

As mentioned in the blog post here are some of the advantages of having the Service Principal as the Warehouse Owner.

  • Using a Service Principal to create the warehouse avoids issue where the person who created the warehouse leaves the organization and issues arise when the users account is deleted from Entra ID.
  • You avoid the painful logging in with the user account to ensure the password remains updated.
  • The organization now owns the warehouse and not an individual user.

I will show you how I created a Warehouse with the owner being a Service Principal this using a Microsoft Fabric Notebook

Click through for the notebook and additional commentary.

Leave a Comment

2025 Data Professional Salary Survey Results

Brent Ozar shares this year’s survey results:

We’ve been running our annual Data Professional Salary Survey for almost a decade, and I was really curious to see what the results would hold this year. How would inflation and layoffs impact the database world? Download the raw data here and slice & dice it to see what’s important to you. Here’s what I found.

Read on for Brent’s analysis and grab the data for yourself to try things out. I’ve used this dataset in the past for presentations and it usually goes over pretty well, especially because it includes quite a few real-life data quality challenges.

Leave a Comment

Working with DBCC OPTIMIZER_WHATIF

Yvonne Vanslageren shows off a very uncommon DBCC command:

DBCC OPTIMIZER_WHATIF is a powerful diagnostic command in SQL Server that allows database administrators and developers to simulate various hardware configurations. By doing so, it shows how the query optimizer would behave under different CPU and memory allocations—without requiring you to physically change server hardware. This makes it an invaluable tool for performance tuningtroubleshooting, and strategic planning in SQL Server environments.

This obviously won’t actually make things faster, but will help you answer questions like “would more CPU cores improve this query or do I need to dump more RAM into the server?”

Leave a Comment

Temp Table Performance in PostgreSQL

Jobin Augustine shares a warning:

PostgreSQL is one of the most powerful database systems in the world. I have always been passionate about its great power, especially its modern SQL language features.

However, that doesn’t mean everything is great. There are areas where it hurts. Novice users unaware of the problem might get into trouble, and I see such things happen very often. This is a note for those who design their solutions around PostgreSQL using temporary tables.

This is very interesting to read, especially in comparison to SQL Server. It’s another example of the adage that, just because something works in a specific way on one platform, it’s not necessarily going to work the same way on another.

Leave a Comment

Automating DBCC CHECKDB on Full Backups

Shane O’Neill riffs off of a script:

Like many shops out there, we can’t run corruption checking on our main production database instance. So, then, what do we do? We take the backups and restore them to a test instances, and then run corruption checking on those restored databases.

At least this way we can test the backups we take can be restored, as well.

But, I don’t want to spend every day manually restoring and corruption checking these databases, so let’s automate this bit…

Click through for a script that Shane built off of a Madeira Data Solutions script that Eitan Blumin put together.

Leave a Comment

It’s Probably Not Data Corruption on Disk

Andy Yun talks storage:

I cannot tell you how many times I’ve encountered scenarios where “this data looks wrong.” Well… can one ensure that it is being retrieved and displayed correctly from the storage media that it resides on in the first place? Are you viewing/validating the data in question correctly? Whatever client/method you are using to review your data – that is suspect and its integrity is in question.

It is technically possible for bits to flip, but that’s also why we have checksums on disk. I’m sure there are people who have experienced storage corruption that changed just enough to cause problems but not enough to be noticeable, but Andy is right on the money.

Leave a Comment

Thoughts on Data Integrity

Deborah Melkin shares some thoughts:

The first way to think of data integrity is a very small and literal interpretation. This is making sure that our data in the database is good. In many ways, these are easy to enforce – you add constraints. Primary Keys ensure that you know what makes each row unique. Unique constraints represent what would make each record unique if the primary key constraint, which is often a surrogate key these days, didn’t exist or offer different options. 

Read on for more about database design, default constraints, and a dive into data modeling.

Leave a Comment

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Leave a Comment

Data Masking in Azure Databricks

Rayis Imayev hides some information:

One way to protect sensitive information from end users in a database is through dynamic masking. In this process, the actual data is not altered; however, when the data is exposed or queried, the results are returned with modified values, or the actual values are replaced with special characters or notes indicating that the requested data is hidden for protection purposes.

In this blog, we will discuss a different approach to protecting data, where personally identifiable information (PII – a term you will frequently encounter when reading about data protection and data governance) is actually changed or updated in the database / persistent storage. This ensures that even if someone gains access to the data, nothing will be compromised. This is usually needed for refreshing the production database or dataset containing PII data elements to a lower environment. Your QA team will appreciate having a realistic data volume that resembles production environment but with masked data.

Rayis goes into depth on the process. I could also recommend checking out the article on row filters and column masks for more information.

Leave a Comment