Press "Enter" to skip to content

Category: Normalization

A Mistake of “Normalization”

Hans-Jürgen Schönig makes an argument:

The concept of “normalization” is often the first thing people who are new to databases are going to learn. We are talking about one of the fundamental principles in the realm of databases. But what is the use of normalization in the first place? Well, we want to avoid redundancies in the data and make sure that information is stored in a way that helps reduce mistakes and inconsistencies. Ultimately, that is all there is to it: No redundancies, no mistakes, no inconsistencies.

There’s an example in this of “too much normalization” but I’m going to push back because this is a common misunderstanding of the idea of normalization.

The example covers removing price from an invoice table and having people look up the price from the product table, as having each price in an invoice is duplication, and we’re trying to eliminate duplication.

This argument is wrong, because it conflates two concepts. The listing price of an item is its current price. This is the thing you will see on a products table. The sale price of an item on the invoice table is a historical artifact and is not the same as the listing price, even if the dollar amounts match. Hans-Jürgen points out the consequence of making this mistake, and is correct in pointing this out. But it’s not “too much normalization” because it misunderstands the domain model and eliminating sale price from a table would remove information. Properly following the rules of normalization means you cannot lose information–that’s what each one of the normal forms does. In this case, we remove an attribute based on a faulty assumption that there is a functional dependency between product ID and sale price (that is, every time you see a specific product ID, you will always see a specific sale price). That’s the crux of the issue in this example, but the concept of normalization takes strays as a result of the faulty assumed functional dependency.

Leave a Comment

Brownfield Data Modeling

Jared Westover discusses a common trade-off:

Some decisions in life are easy, like whether to drink that second cup of coffee. But when it comes to databases, things get complicated fast. Developers often seek my input on adding tables and columns. A common question arises: Should they create a new table or expand an existing one by adding columns? This decision can be tricky because it depends on several factors, including query performance, future growth, and the complexity of implementing either solution. While adding one or two columns to an existing table may seem the easiest option, is it the best long-term solution? In this article, we look at whether it is better to add new columns versus a new table in SQL Server.

As an architectural pro-tip, when you’re looking to add a new column to an existing table, ask yourself if the new attribute you want to add actually relates to the natural key of the existing table. In Jared’s example, the natural key for video game tracker is presumably video game ID (which itself ties back to, presumably, the video game title, developer, console, and release date) and start date. Does a book actually relate to a video game and start date? No, it does not. Therefore, this book attribute does not belong on the video game tracker table.

When you dig deeper into Boyce-Codd Normal Form, you figure out that “relates to” in the prior paragraph translates to “has a functional dependency upon,” but using non-technical language for people not familiar with normalization, you can still get to the same conclusion, because ultimately, 95% of database normalization is common sense that we strenuously apply to a business domain.

And most of the time, the developer knows that this feels weird, but doesn’t want to spend the extra time doing it the best way and instead tries to do it the expedient way. This is where the role of the architect as politician comes in, and we gently guide people to the right conclusion. Or just tell them to put on their big boy britches and do it right. Either way.

Comments closed

An Overview of Normal Forms

Daniel Calbimonte talks normalization:

Various levels of normalization in SQL can be used to reduce data redundancy and have a better-structured relational data model. This tutorial looks at these various levels with explanations and examples in Microsoft SQL Server for beginners.

I disagree with part of Daniel’s explanation of 1NF: I believe that the idea of atomicity, as Daniel defines it, is not part of 1NF. I’m basing this off of CJ Date’s definition of first normal form:

Given relvar R with heading H containing attributes A1…An of types T1…Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

All this says is that we have a single value per attribute in a tuple. “LeBron James, Lakers” and “Stephen Curry, Warriors” are perfectly reasonable values for attributes in first normal form. In Database Design and Relational Theory, Date spends a few pages covering the idea of atomicity and how there’s no good explanation for what, exactly, “atomic” means. Even in Daniel’s example, you could break down player and coach names further, not only into first and last names, but also subsets of characters within those names, like syllables. The closest thing I have for atomicity is the idea that something is atomic when it is at the lowest level given a particular set of data requirements. But that’s not a mathematical rule like the rules of normalization. It’s a business rule, and necessarily fuzzier and subjective.

That said, I like the rest of Daniel’s list and appreciate going to 5th normal form.

1 Comment

Database Normalization: Abnormal Forms

I draw the logical conclusion: the opposite of normal forms is, of course, abnormal forms:

This video covers a variety of topics, effectively wrapping up the series on normalization. We look at data warehousing, including why the Kimball-style star schema is a really bad design in theory but a perfectly reasonably design in practice. We cover the chimera of “overnormalization” and I throw out a hot take. And we finally slag on denormalization.

Click through for the video.

Comments closed

The Utility of 6th Normal Form

I have a new video:

In this video, explain what Sixth Normal Form (6NF) is and why it slots in as the third most-important normal form. We look at two separate use cases in which 6NF can make sense and I provide some guidance on when 5NF is good enough versus when 6NF is better.

6th Normal Form doesn’t necessarily make sense all the time, but there are some really good use cases for it.

Comments closed

An Overview of 4th Normal Form

I continue a series on database normalization:

In this video, [I] explain what Fourth Normal Form (4NF) is and why I consider 5NF to be significantly more important. Even so, 4NF does make it easy to explain a certain common class of problem, allowing it to provide some measure of utility.

4th Normal Form is a special case of the much more exciting 5th Normal Form, but I do have a bit of a soft spot for it.

Comments closed

Embrace the Power of 5th Normal Form

I have a new video up:

In this video, we drill into the other most important normal form, learning what Fifth Normal Form (5NF) is, why Boyce-Codd Normal Form is not enough, and examples of why 5NF can be such a challenge to implement.

Until I read CJ Date’s Database Design and Relational Theory (2nd edition), my level of appreciation for 5th Normal Form was somewhat limited, but that’s mostly because I didn’t understand it well at all. I liked the connection trap example in this article, but Date’s book was the first really good explanation of 5NF and just how powerful it is. My hope is that I was successfully able to convey that power to audiences.

Comments closed

A Primer on Boyce-Codd Normal Form

I have a new video:

In this video, we drill into one of the two most important normal forms, learning what Boyce-Codd Normal Form (BCNF) is, how you can get to BCNF, and a practical example of it. We also learn why I cast so much shade on 2nd and 3rd Normal Forms.

Boyce-Codd Normal Form is one of the two most important normal forms, and I’m pretty happy with the way this video came together to explain how you can get from 1NF into BCNF, as well as the specific benefits this provides.

Comments closed

Two Takes on First Normal Form

Joe Celko defends the honor of First Normal Form:

You do not need a complete understanding of regular expressions or ICD codes to follow this article, so don’t worry too much about it. The reason for posting the simplified regular expression was to scare you. My point was that this regular expression would be a pretty impressive CHECK constraint on this column. Shall we be honest? Despite the fact that we know the best programming practice is to detect an error as soon as possible, do you believe that the original poster wrote such a constraint for the concatenated list of ICD codes?

I’m willing to bet that any such validation is being done in an input tier by some poor lonely program, in an application language. Even more likely, it’s not being done at all.

First Normal Form (1NF) says that this concatenated string is a repeated group, and we need to replace it with a proper relational construct.

In the meantime, I’ve continued my series on database normalization and call First Normal Form overrated:

In this video, we start at the ground floor with 1st Normal Form. We’ll learn what people think it is, what it really is, and why it’s not as great as it’s cracked up to be.

I agree with Joe that his ICD-10 code example is a bad database design. The area in which I don’t agree—and for this, I’m leaning heavily on C.J. Date—is that repeating groups actually violate 1NF. My video covers this in a bit more detail and I also include a quotation from Date’s recent book on database design talking about how 1NF has nothing to do with repeating groups or atomicity, and that 1NF could even include relvars inside of relvars (an example Joe shows 1NF preventing).

Comments closed