Press "Enter" to skip to content

Category: Normalization

Sales Tables and Normalization

Hans-Jürgen Schönig lays out an argument:

When creating data models in PostgreSQL (or any other powerful relational database), several common problems can arise that hinder effective database design. One frequent issue is the incorrect normalization of tables, leading to redundant data, inefficient querying capabilities, or even simply plain wrong semantics. 

First, I agree with the argument Hans-Jürgen is making regarding modeling sales data: it is important to understand what information the company is actually collecting and the domain and business rules associated with that data collection. The real world should shape and define your database model, not the other way around.

But where I fervently disagree is that this is “too much” or “over-optimistic” normalization because this is not normalization. Normalization is the lossless decomposition of data across entities, and normal forms 1-5 break down along two major lines: decompose functional dependencies (Boyce-Codd Normal Form), and decompose join dependencies (4th & 5th normal forms). In the example, there is clearly a loss of information from the wider model—the table definition Hans-Jürgen shows later on—so the split into product + sales tables does not adhere to 5th normal form.

The sales table does need some variant of product_unit_price because that was the price as of the time of sale, and prices can change for a variety of reasons: discounts because the product is on sale, adjustments due to inflation, increases in demand, supply chain issues, adjustments based on new marketing models, etc. For historical purposes, it is important that we capture the price as of the time of sale, as that will tell how much to refund if the customer returns with the product. It is also important for bookkeeping reasons, as a variety of taxes and reports depend upon that number.

But it’s also clear that the product table is reasonable: a product number (following the first rule of database architecture: anything labeled “number” is actually a string with non-numeric characters in it), a product name, and a product price. But this is the current sales price, which is different from the price at the time of sale. Those are two separate attributes and labeling them both “product_unit_price” leads to some of that confusion. They should both be in the data model.

Then, if I’m looking critically at the data model, I would ask why product name is in the sales data. Do product names change so frequently that we need to capture it on every sales record? Do managers have the authority to change the name of a product on the fly during a sale? Probably not and unless managers do have that discretion or the product name changes on a per-sale basis, it doesn’t belong there. There could be a functional dependency from product_number to product_name, such that each unique value of product_name has one and only one product_number. In that case, the product table should have product_name and the sales table should not.

But what if the product name does change over time and we need to track those changes? Then I’d (potentially) relabel product_name on the product table to current_product_name and create a 6th normal form table called something like product_name_history. This table would include the product_number, the product_name, and a time range over which that name was valid for that product. These timeframes should not overlap and, as long as the product exists, there should be an entry that corresponds with the appropriate name at any arbitrary time.

My final model would look something like:

CREATE TABLE product (
    product_number text PRIMARY KEY,
    product_name text NOT NULL,
    product_price numeric(10, 2) NOT NULL
);

CREATE TABLE product_name_history (
    -- Primary key would be on the combo of product_number and begin_time/interval
    product_number text NOT NULL,
    product_name text NOT NULL,
    -- Architected as an interval, implemented as start and end timestamps
    begin_time timestamptz NOT NULL,
    end_time timestamptz NOT NULL
);

CREATE TABLE sales (
    sales_time timestamptz NOT NULL,
    product_number text NOT NULL,
    sale_unit_price numeric(10, 2) NOT NULL,
    sale_units numeric(10, 2) NOT NULL
);

I did also rename “product_unit_price” and “product_units” to “sale_unit_price” and “sale_units” to make it clearer that these were associated with the sale itself. We could also get into things like sales orders versus sale order line items, additional attributes, etc. but this post is already long enough as it is. And again, this assumes that product name changes frequently enough and the different names are relevant enough for us to care about what the product name was at the time of a given sale. If not, we could remove the product_name_history table without otherwise changing the model.

Long story short, I agree with Hans-Jürgen that database design is important and understanding the business domain is critical for designing a good data model. But where I disagree is with the concept of over-normalization. That’s not a real thing. Improper attempts at normalization are certainly a real thing, as we see in this post, and it is important to understand the business rules as well as the rules of defining normal forms in order to prevent this.

Leave a Comment

Multi-Measure Calculations in Relational Databases

Greg Low describes a common business problem:

But while food wholesale systems will need to deal with quantities like I described in that post, they often have another layer of complexity. Items are often sold by:

  • Quantity
  • Weight
  • Quantity and Weight

This is an interesting look at how the domain can drive what a proper solution looks like. It also seems like a good use case for 6th normal form, with unit quantity and unit weight tables to prevent NULL from cropping up.

Comments closed

A Mistake of “Normalization”

Hans-Jürgen Schönig makes an argument:

The concept of “normalization” is often the first thing people who are new to databases are going to learn. We are talking about one of the fundamental principles in the realm of databases. But what is the use of normalization in the first place? Well, we want to avoid redundancies in the data and make sure that information is stored in a way that helps reduce mistakes and inconsistencies. Ultimately, that is all there is to it: No redundancies, no mistakes, no inconsistencies.

There’s an example in this of “too much normalization” but I’m going to push back because this is a common misunderstanding of the idea of normalization.

The example covers removing price from an invoice table and having people look up the price from the product table, as having each price in an invoice is duplication, and we’re trying to eliminate duplication.

This argument is wrong, because it conflates two concepts. The listing price of an item is its current price. This is the thing you will see on a products table. The sale price of an item on the invoice table is a historical artifact and is not the same as the listing price, even if the dollar amounts match. Hans-Jürgen points out the consequence of making this mistake, and is correct in pointing this out. But it’s not “too much normalization” because it misunderstands the domain model and eliminating sale price from a table would remove information. Properly following the rules of normalization means you cannot lose information–that’s what each one of the normal forms does. In this case, we remove an attribute based on a faulty assumption that there is a functional dependency between product ID and sale price (that is, every time you see a specific product ID, you will always see a specific sale price). That’s the crux of the issue in this example, but the concept of normalization takes strays as a result of the faulty assumed functional dependency.

Comments closed

Brownfield Data Modeling

Jared Westover discusses a common trade-off:

Some decisions in life are easy, like whether to drink that second cup of coffee. But when it comes to databases, things get complicated fast. Developers often seek my input on adding tables and columns. A common question arises: Should they create a new table or expand an existing one by adding columns? This decision can be tricky because it depends on several factors, including query performance, future growth, and the complexity of implementing either solution. While adding one or two columns to an existing table may seem the easiest option, is it the best long-term solution? In this article, we look at whether it is better to add new columns versus a new table in SQL Server.

As an architectural pro-tip, when you’re looking to add a new column to an existing table, ask yourself if the new attribute you want to add actually relates to the natural key of the existing table. In Jared’s example, the natural key for video game tracker is presumably video game ID (which itself ties back to, presumably, the video game title, developer, console, and release date) and start date. Does a book actually relate to a video game and start date? No, it does not. Therefore, this book attribute does not belong on the video game tracker table.

When you dig deeper into Boyce-Codd Normal Form, you figure out that “relates to” in the prior paragraph translates to “has a functional dependency upon,” but using non-technical language for people not familiar with normalization, you can still get to the same conclusion, because ultimately, 95% of database normalization is common sense that we strenuously apply to a business domain.

And most of the time, the developer knows that this feels weird, but doesn’t want to spend the extra time doing it the best way and instead tries to do it the expedient way. This is where the role of the architect as politician comes in, and we gently guide people to the right conclusion. Or just tell them to put on their big boy britches and do it right. Either way.

Comments closed

An Overview of Normal Forms

Daniel Calbimonte talks normalization:

Various levels of normalization in SQL can be used to reduce data redundancy and have a better-structured relational data model. This tutorial looks at these various levels with explanations and examples in Microsoft SQL Server for beginners.

I disagree with part of Daniel’s explanation of 1NF: I believe that the idea of atomicity, as Daniel defines it, is not part of 1NF. I’m basing this off of CJ Date’s definition of first normal form:

Given relvar R with heading H containing attributes A1…An of types T1…Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

All this says is that we have a single value per attribute in a tuple. “LeBron James, Lakers” and “Stephen Curry, Warriors” are perfectly reasonable values for attributes in first normal form. In Database Design and Relational Theory, Date spends a few pages covering the idea of atomicity and how there’s no good explanation for what, exactly, “atomic” means. Even in Daniel’s example, you could break down player and coach names further, not only into first and last names, but also subsets of characters within those names, like syllables. The closest thing I have for atomicity is the idea that something is atomic when it is at the lowest level given a particular set of data requirements. But that’s not a mathematical rule like the rules of normalization. It’s a business rule, and necessarily fuzzier and subjective.

That said, I like the rest of Daniel’s list and appreciate going to 5th normal form.

1 Comment

Database Normalization: Abnormal Forms

I draw the logical conclusion: the opposite of normal forms is, of course, abnormal forms:

This video covers a variety of topics, effectively wrapping up the series on normalization. We look at data warehousing, including why the Kimball-style star schema is a really bad design in theory but a perfectly reasonably design in practice. We cover the chimera of “overnormalization” and I throw out a hot take. And we finally slag on denormalization.

Click through for the video.

Comments closed

The Utility of 6th Normal Form

I have a new video:

In this video, explain what Sixth Normal Form (6NF) is and why it slots in as the third most-important normal form. We look at two separate use cases in which 6NF can make sense and I provide some guidance on when 5NF is good enough versus when 6NF is better.

6th Normal Form doesn’t necessarily make sense all the time, but there are some really good use cases for it.

Comments closed

An Overview of 4th Normal Form

I continue a series on database normalization:

In this video, [I] explain what Fourth Normal Form (4NF) is and why I consider 5NF to be significantly more important. Even so, 4NF does make it easy to explain a certain common class of problem, allowing it to provide some measure of utility.

4th Normal Form is a special case of the much more exciting 5th Normal Form, but I do have a bit of a soft spot for it.

Comments closed

Embrace the Power of 5th Normal Form

I have a new video up:

In this video, we drill into the other most important normal form, learning what Fifth Normal Form (5NF) is, why Boyce-Codd Normal Form is not enough, and examples of why 5NF can be such a challenge to implement.

Until I read CJ Date’s Database Design and Relational Theory (2nd edition), my level of appreciation for 5th Normal Form was somewhat limited, but that’s mostly because I didn’t understand it well at all. I liked the connection trap example in this article, but Date’s book was the first really good explanation of 5NF and just how powerful it is. My hope is that I was successfully able to convey that power to audiences.

Comments closed