Category: Data Modeling

Recommendations for Dedicated SQL Pool Data Modeling

Published 2023-02-17 by Kevin Feasel

In this article, I will discuss how to physically model an Azure Synapse Analytics data warehouse while migrating from an existing on-premises MPP (Massive Parallel Processing) data warehouse solution like Teradata and Netezza. The approach and methodologies discussed in this article are purely based on the knowledge and insight I have gained while migrating these data warehouses to Azure Synapse dedicated SQL pool.

Dedicated SQL pools are close enough to regular SQL Server that we make a lot of assumptions about it, some of which may be wrong.

Comments closed

Degenerate Dimensions and Power BI DirectQuery

Published 2022-12-27 by Kevin Feasel

Chris Webb provides a warning:

A few weeks ago my colleague Dany Hoter wrote a post on the Azure Data Explorer blog about how using columns on fact tables as dimensions in DirectQuery mode can lead to errors in Power BI. You can read it here:

https://techcommunity.microsoft.com/t5/azure-data-explorer-blog/to-star-or-not-to-star-more-about-data-modeling-in-power-bi/ba-p/3689933

In the post he mentioned that he could reproduce the same behaviour in SQL Server, so I thought it would be good to show an example of this to raise awareness of the issue because I think it’s one that anyone using DirectQuery mode on any data source is likely to run into.

Read on to understand what might lead to (expected) errors and what you can do about it.

Comments closed

Bitemporal Modeling and Running Totals

Published 2022-10-24 by Kevin Feasel

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Tuning a Range Query

Published 2022-10-04 by Kevin Feasel

Grant Fritchey gets forum-sniped:

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.
Yeah. Identical to mine. Almost line for line.
Well, nuts.
I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed

Have One Data Model per Business Area

Published 2022-07-15 by Kevin Feasel

James McGillivray offers us an important piece of advice:

I cannot stress this enough. If people are consuming your data in multiple places, the data needs to come from the same data model. That can be an Enterprise Data Warehouse, a Data Mart, a Power BI Model, or any other data source, but at some point you need to be able to track the data back to a single place. If you don’t do this, you will spend THE REST OF YOUR DAYS explaining the differences between the data models to business and customers, and reconciling the differences over and over again.

Read on to learn why this is so important.

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Published 2022-06-02 by Kevin Feasel

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nickname, color and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

Unicode and Data Length

Published 2022-02-16 by Kevin Feasel

Kevin Wilkie lays out an argument:

If you truly need the UNICODE characters in your data, go ahead and use them! If not though, please make your DBA happy by not using them. Since UNICODE characters take up twice the amount of space as the ASCII versions do, then your DBAs will recommend to use the ASCII versions if you are not going to be using any UNICODE characters.

Read on for the justification. But I’m still NVARCHAR (Almost) Everywhere.

Comments closed

Database Schema Types

Published 2022-02-11 by Kevin Feasel

Steve Jones explains schema types:

OLTP/Relational
The type of schema that many of us work with is the standard OLTP or relational model. We have lots of transaction tables, most should have a PK, some of which have PKs. The schema expands to meet different needs and can have lots of entities.

It may just be the time of morning but “Galaxy schema” sounds dumb specifically because the Kimball style of star schema implicitly includes what the galaxy schema shows. Dimensions are conformed, which means they apply across facts, which implies that there may be multiple facts in the schema design. This means that galaxy schemas necessarily star schemas. For the sake of education, we tend to focus on one fact table but a star schema with two fact tables is still a star schema.

Anyhow, that’s my minor rant of the day. It’s not Steve’s fault somebody misunderstood the concept of star schemas and began promulgating this unnecessary term.

Comments closed

Swap an Identity Column with a Sequence

Published 2022-01-05 by Kevin Feasel

Greg Larsen looks at sequences:

Microsoft introduced the sequence number objects starting with SQL Server 2012. A sequence object generates sequence numbers based on starting and increment values, similar to an identity column, but it has additional features. Over time, you might find that the additional benefits of a sequence number have you wanting to replace an identity column with a sequence number. This article demonstrates two options for replacing an identity column with a sequence number.

When it came out, I thought I was going to use sequences a quite often. In practice, I’ve used it a few times, but IDENTITY is so much easier to type and I’ve rarely needed sequence generators. That said, the times I have needed them, I definitely appreciate their existence.

Comments closed

The Importance of Data Dictionaries

Published 2021-12-16 by Kevin Feasel

John Morehouse takes us through data dictionaries:

Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.
It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.

Click through to understand why you should trust John on this one.

Comments closed