Data Modeling – Curated SQL

Thoughts on Data Integrity

Published 2025-01-15 by Kevin Feasel

The first way to think of data integrity is a very small and literal interpretation. This is making sure that our data in the database is good. In many ways, these are easy to enforce – you add constraints. Primary Keys ensure that you know what makes each row unique. Unique constraints represent what would make each record unique if the primary key constraint, which is often a surrogate key these days, didn’t exist or offer different options.

Read on for more about database design, default constraints, and a dive into data modeling.

Comments closed

Custom SCD2 with PySpark

Published 2025-01-14 by Kevin Feasel

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Implementing Role-Playing Dimensions in Power BI

Published 2024-10-15 by Kevin Feasel

Teo Lachev puts on a mask:

Role-playing dimensions are a popular business requirement but yet challenging to implement in Power BI (and Tabular) due to a long-standing limitation that two tables can’t be joined multiple times with active relationships. Declarative relationships are both a blessing and a curse and, in this case, we are confronted with their limitations. Had Power BI allowed multiple relationships, the user must be prompted which path to take. Interestingly, a long time ago Microsoft considered a user interface for the prompting but dropped the idea for unknown reasons.

Given the existing technology limitations, you have two implementation choices for implementing subsequent role-playing dimensions: duplicating the dimension table (either in DW or semantic model) or denormalizing the dimension fields into the fact table. The following table presents pros and cons of each option:

Click through for that table, as well as some thoughts on viable approaches, including an edge case.

Comments closed

Tips for Optimizing Power BI Semantic Models

Published 2024-10-15 by Kevin Feasel

Koen Verbeeck shares some tips:

Power BI is designed to be user-friendly. With just a few clicks, you can import data from various sources, combine them together in one data model and start analyzing it using powerful data visualizations. This sometimes leads to a scenario where people are just importing data into the tool without giving it too much thought. When you’re working on a solo project on a small dataset, there probably won’t be too many issues. But what if your report is successful and you want to share it with your colleagues and maybe other departments? Or more data is loaded into the model, but refreshes are taking more and more time? Even other data sources are added into your model, but writing DAX formulas has become hard, and reports are slowing down.

In this article, we’ll cover a couple of tricks that will help you make your Power BI models smaller, faster and easier to maintain. In the immortal words of Daft Punk: “Harder. Better. Faster. Stronger”.

Click through for those tricks and tips.

Comments closed

Microsoft Purview Classifications and Sensitivity Labels

Published 2024-07-25 by Kevin Feasel

James Serra labels the data:

I see a lot of confusion on how classifications and sensitivity labels work in Microsoft Purview. This blog will help to clear that up, but I first must address the confusion with Purview now that multiple products have been renamed to Microsoft Purview. I decided to use a question-and-answer format that will hopefully clear up the confusion (I was very confused too!):

Purview is a fantastic product. I just wish it cost about 10% as much as it does; then I could heartily recommend it to people.

Comments closed

Microsoft Fabric and Semantic Models

Published 2024-03-01 by Kevin Feasel

Kurt Buhler has a choose-your-own-adventure story:

Semantic models are integral to Microsoft Fabric. They use and are used by many of the different workloads. In Fabric, there’s more items that can connect to and consume your model—such as semantic link in notebooks. Because of these new options and tools, your model is exposed to additional types of users who will use it in different ways. As such, it’s important that you make good models that you manage well throughout their entire lifecycle.

Read on for more information and three separate scenarios

Comments closed

Using Schema Registry for Data Quality in Apache Kafka

Published 2024-01-05 by Kevin Feasel

Kai Waehner talks data quality:

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Click through to learn more about the topic. This focuses a lot on the “why” and “what” but does have an example of “how” in there as well.

Comments closed

The Value of Data Lineage

Published 2023-12-29 by Kevin Feasel

Chisom Kanu explains why data lineage matters:

Data lineage is a component of modern data management that helps organizations understand the origins, transformations, and movement of their data. It is like a road map that shows us where our data has been, how it has changed, and where it is going, just like tracking the journey of a package: from the person who sent it (the source) to the places it passes through, and finally to the person who receives it.

The concept of data lineage has been around for many years, but it has become increasingly important in recent years due to the growth of big data and the increasing complexity of data processing systems.

Read on to learn more about data lineage.

Comments closed

Building a Multi-Tenant Database

Published 2023-10-25 by Kevin Feasel

Adron Hall looks at multi-tenancy within Postgres:

Music has always been a significant part of my life. From the melodies that accompany my daily routines to the anthems of my most memorable moments, it’s been a constant. As my collection grew, I realized I needed a better way to organize it. That’s when I stumbled upon the concept of multi-tenancy databases and decided to give it a shot with PostgreSQL. Here’s my experience.

Multi-tenancy is one case in which I’m much more relaxed about including the tenant ID on tables where it is not absolutely necessary in order to prevent a series of joins to get the appropriate tenant ID. We can quibble about whether that’s reasonable denormalization or appropriate use of a superkey—especially because, in SQL Server, tenant ID ends up being part of the clustered index and likely part of the primary key anyhow—but it’s extremely useful nonetheless.

Comments closed

String Regularization and Tokenization in SQL Server

Published 2023-10-04 by Kevin Feasel

Aaron Bertrand saves some space:

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Modeling