Category: Normalization

Two Takes on First Normal Form

Published 2023-10-11 by Kevin Feasel

Joe Celko defends the honor of First Normal Form:

You do not need a complete understanding of regular expressions or ICD codes to follow this article, so don’t worry too much about it. The reason for posting the simplified regular expression was to scare you. My point was that this regular expression would be a pretty impressive CHECK constraint on this column. Shall we be honest? Despite the fact that we know the best programming practice is to detect an error as soon as possible, do you believe that the original poster wrote such a constraint for the concatenated list of ICD codes?

I’m willing to bet that any such validation is being done in an input tier by some poor lonely program, in an application language. Even more likely, it’s not being done at all.

First Normal Form (1NF) says that this concatenated string is a repeated group, and we need to replace it with a proper relational construct.

In the meantime, I’ve continued my series on database normalization and call First Normal Form overrated:

In this video, we start at the ground floor with 1st Normal Form. We’ll learn what people think it is, what it really is, and why it’s not as great as it’s cracked up to be.

I agree with Joe that his ICD-10 code example is a bad database design. The area in which I don’t agree—and for this, I’m leaning heavily on C.J. Date—is that repeating groups actually violate 1NF. My video covers this in a bit more detail and I also include a quotation from Date’s recent book on database design talking about how 1NF has nothing to do with repeating groups or atomicity, and that 1NF could even include relvars inside of relvars (an example Joe shows 1NF preventing).

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Published 2022-06-02 by Kevin Feasel

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nickname, color and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

Saving Space with 6NF in SQL Server

Published 2022-04-11 by Kevin Feasel

Aaron Bertrand has a two-parter. Part one sets up the problem:

We often build logging or other insert-only tables where we store large strings like URLs, host names, or error messages. It’s usually not until the table has terabytes of data that we realize there might have been a better way. If we are logging traffic or exceptions for our own application, it’s likely that we record the same URL, host name, or error message on millions of rows. What if we only had to write that URL or host name or message text once, the first time we saw it? In this tip, I want to share one idea for abstracting away recurring values, reducing storage, and making search queries faster (especially those with wildcards) without requiring immediate changes in the application layer.

Part two maximizes the savings:

In my previous tip, I showed how we can make a growing logging table leaner by moving large, repeating strings to their own dimension tables. The solution there involved an AFTER INSERT trigger and assumed that we could change the applications to recognize the new table structure in relatively short order.

Check out both posts for more details. If you’re confused about my calling this 6NF and Aaron mentioning dimension tables, the answer is that he’s talking about the end result and I’m describing the process.

Comments closed

Abnormal Tables and Skewed Data

Published 2021-10-29 by Kevin Feasel

Erik Darling reminds us to be vigilant in database design:

But the Posts table suffers from a serious design flaw in the public data dump: Questions and Answers are in the same table.
I’ve heard that it’s worse behind the scenes, but I don’t have any additional details on that.

Read on to understand why this is a problem and what the ramifications are.

Comments closed

Sixth Normal Form to Avoid NULLs

Published 2021-10-26 by Kevin Feasel

I have a response to an article:

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

Indicators of Schema Issues

Published 2021-02-04 by Kevin Feasel

Erik Darling has a good list of schema-related issues:

Something is broken in the way that you store data.
You’re overloading things, and you’re going to hit big performance problems when your database grows past puberty.

Most of what he’s describing in this post is a failure of atomicity, which implies a failure to achieve first normal form. Mind you, all of these functions are perfectly reasonable as part of data loading, and many of them are perfectly reasonable in the SELECT clause of a query (though that’s still a sign of failure of atomicity), but once you start throwing them into the WHERE clause, we’ve got problems.

Comments closed

Normalization and Reduced Blocking

Published 2021-01-20 by Kevin Feasel

Erik Darling points out one of the many benefits of normalizing tables in a database:

Looking at the design, there are two big problems:
1. There are “order” columns that are going to get a lot of inserts and updates
2. You’re going to be storing the same customer information over and over again
The more related, but not independent, data you store in the same table, the harder it becomes to effectively index that table.

My take on this is that the old adage of “Normalize until it hurts; denormalize until it works” hasn’t been operative for the past 15 years, when the SSD era began.

Comments closed

Database Normal Forms

Published 2020-11-23 by Kevin Feasel

Joe Celko walks us through key and less-key normal forms:

Even before RDBMS, we had network and hierarchical databases. Their first goal was to remove redundancy. We want to store one fact, one way, one place, and one time. Normalization goes a step further. The goal of normalization is to prevent anomalies in the data. An anomaly could be an insertion anomaly, update anomaly, or deletion anomaly. This means that doing one of those basic operations destroys a fact or creates a falsehood.

It’s an interesting read on a sadly-neglected topic.

Comments closed

Data Type Conversions in Predicates

Published 2019-05-22 by Kevin Feasel

Bert Wagner takes us through a troublesome table design:

This table stores data for an application that has many different types of Pages. Each Page stores different types of data, but instead of creating a separate table for each type, we store all the different data in the varchar DataValue column and maintain the original data type in the DataType column.
This structure reduces the complexity required for maintaining our database (compared to creating possibly hundreds of tables, one for each PageName) and makes querying easier (only need to query one table). However, this design could also lead to some unexpected query results.

This is your daily reminder that an attribute should be a thing which describes an entity, not one of multiple things.

Comments closed

Tidying Video Game Data

Published 2019-04-17 by Kevin Feasel

Arvid Kingl has a fun article analyzing data from an open-source video game and applying tidy data principles to it:

You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.
Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.

Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.

Comments closed