Category: Data Modeling

Sixth Normal Form to Avoid NULLs

Published 2021-10-26 by Kevin Feasel

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

Against Reporting Tables

Published 2021-10-01 by Kevin Feasel

Erik Darling doesn’t like reporting tables:

I’ve seen a lot of crazy ways for people to run reports on live data, but one of the worst ideas is creating a reporting table based on the set of data a user wants, and then letting them query that data.

As usual, Erik says something I want to disagree with, and then I read the post and don’t really disagree with him—or if I do, he’s already laid out the “Yes, I understand X” exception. I’ve used reporting tables to good effect, but the important thing is that they’re general-purpose and designed into the application, not specific to a single user.

Comments closed

Database Deployment with External References

Published 2021-09-30 by Kevin Feasel

Sebastian Meine and Liz Baron try to untangle the Gordian knot:

Most database developers are dealing with databases that contain external references. Even if the database code is in source control, these external references can make it very difficult to deploy to new environments. In these multi-database environments, tools like SQLCompare and SQL Change Automation do not automatically resolve object-order across databases, resulting in errors during deployment.
One way to tackle this, which works especially well for CI pipelines, is to create facades for all externally referenced databases. A facade in this context is a database with the expected name, with the expected objects, but those objects are hollowed out and do not contain any dependencies. You can compare this concept to an interface in an object-oriented language. Once you have these facades, they can be used in a pre-deployment step, simplifying the rest of the deployment by effectively removing object-order dependencies with these external databases.

This is one of the most painful parts of converting existing databases into model-driven database development. Especially once you start having to deal with cross-dependencies and rapidly-changing databases.

Comments closed

Create and Update Columns on Tables

Published 2021-09-29 by Kevin Feasel

Kenneth Fisher adds four columns to a table:

One of the easiest ways to collect information about table activity is to add a series of audit columns to the table. The most common set of column consists of four columns.
– When was the row created?
– Who created it?
– When was the row last updated?
– Who last updated it?

Read on to see how to create these, but definitely read the comments. Joe Celko has a good reason to avoid this style, and there’s another good reason as well: if you update the columns and the updated values are longer than what they replace, you can end up generating a lot of page splits as full pages get broken up to handle those in-page updates.

If you do need something like this, I’m growing more and more fond of an activity log table which is append-only and tracks these operations separately. That removes the page splits, allows you to deal with deletions, and gives a better idea of what happened over time for a third party. It does come at a cost if you frequently need to show the create and last updated dates (or users), but my experience has been that a vast majority of the time, we include those for internal purposes, not to display to users.

Comments closed

Type 1 SCDs in Delta Lake

Published 2021-09-16 by Kevin Feasel

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.
Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:
https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed

Compressing JSON in SQL Server

Published 2021-08-26 by Kevin Feasel

Randolph West has a recommendation:

I’ll also pre-emptively note that if this table was simply an append-only archive table, the row size would not really matter. Unfortunately, this table participates in thousands of transactions per day, and as the original developers used Entity Framework and didn’t think much of using NVARCHAR(MAX), the entire row is coming over the wire into the application each time it is queried.
As I’ve written previously about this kind of thing, this is not a good design pattern. Using the VARBINARY(MAX) data type with COMPRESS in the INSERT/UPDATE queries — and DECOMPRESS in the SELECT queries — is a much better design pattern and dramatically reduces the amount of data transferred over the network. Additionally, SQL Server needs significantly less space to store, maintain, and back up this compressed data.

Read on to see the likely benefits from doing this. I’d say that if your main purpose of storing the JSON is just to pass a blob back and forth, then yes, do compress. If you’re frequently shredding these sorts of large documents within SQL Server…well, probably time for a better data model.

Comments closed

Tools for Data Modeling

Published 2021-08-03 by Kevin Feasel

Rayis Imayev shares information on a few database modeling tools:

A picture is worth a thousand words, the same way a visual database schema is better than a database model communicated by a multitude of data scripting text objects.
Someone may spend a significant amount of time trying to describe all the database tables’ attributes, constraints and relationships between tables with words while a visual Entity Relationship Diagram (ERD) may only take a few minutes (or seconds) to tell the same story.
There are different and similar tools available to help you to create these visual ERD artifacts. The more visual appeal those tools may demonstrate, the fewer efforts would be required to add new information by keyboard-typing. Visual drag-drop experience will prevail, and only to type explicit list of attributes/types/etc. of your data model entities, that’s where your keyboard-typing skills will still be necessary.

Click through for reviews of several tools, ranging from free to expensive.

Comments closed

Comparing Views to Derived Tables and CTEs

Published 2021-06-09 by Kevin Feasel

Itzik Ben-Gan takes us through views:

As usual when discussing relational theory, we SQL practitioners are often told that the terminology we’re using is wrong. So, in this spirit, right off the bat, I’ll start by saying that when you use the term tables and views, it’s wrong. I’ve learned this from Chris Date.
Recall that a table is SQL’s counterpart to a relation (oversimplifying the discussion around values and variables a bit). A table could be a base table defined as an object in the database, or it could be a table returned by an expression—more specifically, a table expression. That’s similar to the fact that a relation could be one that is returned from a relational expression. A table expression could be a query.
Now, what is a view? It’s a named table expression, much like a CTE is a named table expression. It’s just that like I said, a view is a reusable named table expression that is created as an object in the database, and is accessible to those who have the right permissions. This is all to say, a view is a table. It’s not a base table, but a table nonetheless. So just like saying “a rectangle and a square” or “a whisky and a Lagavulin” would seem strange (unless you had too much Lagavulin!), using “tables and views” is as improper.

Yeah, if we’re going to push our glasses up the bridges of our noses and get all relational here, we’d have the relvar (which is a name and collection of attributes + data types) and the relation (which you can think of as a relvar at a particular point in time—that’s where we get tuples of data). And as Itzik points out, what the RDBMS world calls a view quite neatly fits the definition of a relvar, as we define a name and collection of attributes + data types.

If all of this could not mean less to you, still read the article for Itzik’s view on views.

Comments closed

Empty Strings vs NULL

Published 2021-06-07 by Kevin Feasel

Erik Darling lays out a challenge:

Empty Strings Aren’t Better Than NULLs
Prove me wrong.

Click through for commenters’ perspectives.

I’m not at the C.J. Date level of “NULLs are a mockery of the entire relational system and should be burned to the ground and that ground be salted and that salt be burned, just in case” but I do have my sympathies there.

An empty string has meaning: the value of a given attribute in this tuple is an empty string. NULL has no value and therefore violates first normal form (which, unlike my joking histrionics above, is an argument Date lays out in far too much detail for me to include here).

So what if you don’t know the value of a thing? Then the answer is, don’t store it! If there are certain attributes which may be missing at insertion time and are nonetheless relevant to maintain, use 6th Normal Form for each of those attributes. Then, the existence of a record indicates that it has a value and the lack of existence indicates that there is no value as of this time. This is quite different from NULL, about which we can only say “I don’t know if there is a value at this time, but if you’d like, I can do weird things with some of your queries if you happen to forget about this non-information.”

3 Comments

Designing and Managing Large Datasets in Power BI

Published 2021-06-02 by Kevin Feasel

Paul Turley continues a series on doing Power BI the right way:

I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.
This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.

Read on for Paul’s tips.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31