Press "Enter" to skip to content

Category: Normalization

Sixth Normal Form to Avoid NULLs

I have a response to an article:

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

Indicators of Schema Issues

Erik Darling has a good list of schema-related issues:

Something is broken in the way that you store data.

You’re overloading things, and you’re going to hit big performance problems when your database grows past puberty.

Most of what he’s describing in this post is a failure of atomicity, which implies a failure to achieve first normal form. Mind you, all of these functions are perfectly reasonable as part of data loading, and many of them are perfectly reasonable in the SELECT clause of a query (though that’s still a sign of failure of atomicity), but once you start throwing them into the WHERE clause, we’ve got problems.

Comments closed

Normalization and Reduced Blocking

Erik Darling points out one of the many benefits of normalizing tables in a database:

Looking at the design, there are two big problems:

1. There are “order” columns that are going to get a lot of inserts and updates
2. You’re going to be storing the same customer information over and over again

The more related, but not independent, data you store in the same table, the harder it becomes to effectively index that table.

My take on this is that the old adage of “Normalize until it hurts; denormalize until it works” hasn’t been operative for the past 15 years, when the SSD era began.

Comments closed

Database Normal Forms

Joe Celko walks us through key and less-key normal forms:

Even before RDBMS, we had network and hierarchical databases. Their first goal was to remove redundancy. We want to store one fact, one way, one place, and one time. Normalization goes a step further. The goal of normalization is to prevent anomalies in the data. An anomaly could be an insertion anomaly, update anomaly, or deletion anomaly. This means that doing one of those basic operations destroys a fact or creates a falsehood.

It’s an interesting read on a sadly-neglected topic.

Comments closed

Data Type Conversions in Predicates

Bert Wagner takes us through a troublesome table design:

This table stores data for an application that has many different types of Pages. Each Page stores different types of data, but instead of creating a separate table for each type, we store all the different data in the varchar DataValue column and maintain the original data type in the DataType column.

This structure reduces the complexity required for maintaining our database (compared to creating possibly hundreds of tables, one for each PageName) and makes querying easier (only need to query one table). However, this design could also lead to some unexpected query results.

This is your daily reminder that an attribute should be a thing which describes an entity, not one of multiple things.

Comments closed

Tidying Video Game Data

Arvid Kingl has a fun article analyzing data from an open-source video game and applying tidy data principles to it:

You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.

Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.

Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.

Comments closed

Normalization Code Smells

Erik Darling has a few tips to see if you’re not properly normalizing your tables:

First Sign Of Problems: Prefixed Columns

Do you have columns with similar prefixes?

If you have naming patterns like this, it’s time to look at splitting those columns out.

I took the Users and Posts tables from Stack Overflow and mangled them a bit to look like this.

You may not have tables with this explicit arrangement, but it could be implied all over the place.

One great way to tell is to look at your indexes. If certain groups of columns are always indexed together, or if there are lots of missing index requests for certain groups of columns, it may be time to look at splitting them out into different tables.

These things tend to happen, but they can have serious negative consequences for database performance, not to mention the risk of bad data sneaking in.

Comments closed

Normalizing To Boyce-Codd Normal Form

I am a big fan of Boyce-Codd Normal Form:

Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms.  There are a couple of requirements to be in Boyce-Codd Normal Form.  First, your table must be in First Normal Form.  This means that:

  • Every entity (row) has a consistent shape.  This is something relational databases do for you automatically:  you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
  • Every entity has a unique value.  You can uniquely identify any particular row.
  • Every attribute is atomic:  you don’t try to pack more than one value into a single attribute.
  • There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.

The other half of BCNF is that every determinant on an entity is a key.

Also click through for an iterative, easy-to-follow process to get to BCNF.

Comments closed

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Comments closed

Normalize

I walk through a scenario which underscores the importance of normalization:

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.

Comments closed