Press "Enter" to skip to content

Curated SQL Posts

Thoughts on Data Integrity

Deborah Melkin shares some thoughts:

The first way to think of data integrity is a very small and literal interpretation. This is making sure that our data in the database is good. In many ways, these are easy to enforce – you add constraints. Primary Keys ensure that you know what makes each row unique. Unique constraints represent what would make each record unique if the primary key constraint, which is often a surrogate key these days, didn’t exist or offer different options. 

Read on for more about database design, default constraints, and a dive into data modeling.

Comments closed

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Data Masking in Azure Databricks

Rayis Imayev hides some information:

One way to protect sensitive information from end users in a database is through dynamic masking. In this process, the actual data is not altered; however, when the data is exposed or queried, the results are returned with modified values, or the actual values are replaced with special characters or notes indicating that the requested data is hidden for protection purposes.

In this blog, we will discuss a different approach to protecting data, where personally identifiable information (PII – a term you will frequently encounter when reading about data protection and data governance) is actually changed or updated in the database / persistent storage. This ensures that even if someone gains access to the data, nothing will be compromised. This is usually needed for refreshing the production database or dataset containing PII data elements to a lower environment. Your QA team will appreciate having a realistic data volume that resembles production environment but with masked data.

Rayis goes into depth on the process. I could also recommend checking out the article on row filters and column masks for more information.

Comments closed

Optimizing Incremental Inventory Calculations in DAX

Marco Russo and Alberto Ferrari track account balances:

Computing an inventory level or an account balance at a given time is a common requirement for many reports. However, when the source data contains all the transactions since the initial zero state, the calculation requires a running sum from the beginning of the data history until the day considered. While easy to implement, a calculation like this can be extremely expensive depending on several factors: the number of cells to compute in the report, the data volume of the transactions, and the cardinality of the dimensions.

The usual approach to optimizing this type of calculation is to introduce a snapshot table that pre-calculates the value of each date for all the dimensions required. Because of the resulting data volume, this solution can be very expensive both in terms of processing time and in terms of resulting memory consumption. A tradeoff is to limit the cardinality of the time available for the snapshot, for example by creating a monthly or quarterly snapshot instead of a daily snapshot. However, this approach limits the analysis of inventory or balance amount trends, and it removes any detail below the snapshot cardinality.

This article shows how to implement a hybrid approach that minimizes the snapshot cost without losing analytical capabilities. This provides outstanding query performance for the reports.

To an extent, this replicates what accountants do with general ledger operations: after an accounting period ends, you summarize results. Then, you can start from the summarized end point instead of needing to read each journal entry. It also provides a convenient time to perform audits and ensure that the numbers are what they should be.

Comments closed

Connecting a Web App to Azure SQL DB via Entra Managed Identity

Joey D’Antoni doesn’t have time to create a password:

Managed identities in Microsoft Entra have simplified authentication in Azure, particularly If all of your resources are in Azure. In this basic example, I’m going to walk you through connecting a Web App (aka App Service) running an API server to an Azure SQL Database. The best part? There are no passwords involved–let’s get started.

Read on for the process. It’s interesting to see how far identity-based security has developed in Azure over the years.

Comments closed

Thoughts on Primary and Foreign Key Constraints

Rob Farley lays out an argument:

I am NOT suggesting that data integrity is irrelevant. Not at all. But how often do we need an enforced primary key or foreign key?

Be warned – I’m not actually going to come to a conclusion in this post. I’m going to make cases for both sides, and let you choose where you stand. The T-SQL Tuesday topic this month is Integrity, so some other people might have written on a similar topic, and produce even more compelling arguments one way or another. I’m the host this time, so you’ll be able to find the round-up (once it’s there) here on the LobsterPot Solutions site too.

I will come to a conclusion and it is that OLTP systems need primary and foreign key constraints to work properly. In the post, Rob asks a question around the last time I saw a key violation error in my application. The good(?) news is that I have plenty of them in the last application I built on SQL Server, because I need to rely on a source system that dumps data and doesn’t actually check to see if existing records were there already. That means I can’t simply perform an inner join from my table to the source table, because I could get multiple records back. No, instead, I need to use a common table expression or APPLY operator, retrieve the max values from the flotsam and jetsam that exists, and make my code harder to follow and perform worse as a result.

Distributed warehousing systems don’t have enforceable keys because of the technical challenge of enforcing keys without having different nodes talk to each other. But these things also assume either that you’ve pre-validated all of the data (like in a Kimball model), that you don’t care about duplicate records or messiness, or that you’ll fix the problem again somewhere downstream. Which, in the case of Microsoft Fabric, is typically necessary by the time you put the data into a semantic model, as those things really don’t like duplicate records and this tends to mess up relationships between tables.

Comments closed

Changing the Source Lakehouse in a Power BI Deployment Pipeline

Chris Webb makes a switch:

If you’re using deployment pipelines with Direct Lake semantic models in Power BI you’ll have found that when you deploy your model from one stage to another by default the model still points to the Lakehouse it was originally bound to. So, for example, if you deploy your model from your Development stage to your test stage, the model in the Test stage still points to the Lakehouse in the Development stage. The good news is that you can use the deployment rules feature of deployment pipelines to make sure the model in the Test stage points to a Lakehouse in the Test stage and in this post I’ll show you how.

Click through for the process.

Comments closed

Thoughts on Dogma and Power BI

Eugene Meidinger wants you to leave Plato’s cave:

I continue to be really frustrated about the dogmatic approach to Power BI. Best practices become religion, not to be questioned or elaborated on. Only to be followed. And you start to end up with these 10 Power BI modeling commandments:

Click through for those ten, followed by a valuable rant. This one’s a challenge for me because I understand where Eugene is going and agree. But there exists a fairly large subset of the population for whom Power BI (or whatever) isn’t a core part of the job and these people simply want an answer in order to complete a task and move on, not to gain a deeper understanding of the product. And it can be a challenge to differentiate these people from the people who would benefit from the more detailed explanation.

In short, you can drag a man out of Plato’s cave, but you can’t make him think.

Comments closed

Parquet File Customization and SQL Server

Ed Pollack writes some files:

Previously, we introduced and discussed the Parquet file format and SQL Server and why it is an ideal format for storing analytic data when it does not already reside in a native analytic data store, such as a data lake, data warehouse, or an Azure managed service.

Both Python and the Parquet file format are quite flexible, allowing for significant customization to ensure that file-related tasks are as optimal as possible. Compatibility with other processes, as well as keeping file sizes and properties under control will also be introduced here.

Click through for some examples.

Comments closed

Table-Level Locks in PostgreSQL

Gulcin Yildirim Jelinek takes us through locking in PostgreSQL:

All locking, whatever their type is, will reduce the throughput, and potentially increase the latency, which means a loss of performance, as nothing is ever free. If my intention is to make sure my data does not have corruption and everyone is getting a correct result at their time of query, I have to agree that I’d have to lock access when multiple transactions are targeting the same table or same row to make sure we take some time to keep the order of things instead of showing wrong results, fast.

Read on for a quick primer on multi-version concurrency control, locking, and lock mechanisms in PostgreSQL.

Comments closed