Normalization Code Smells

Erik Darling has a few tips to see if you’re not properly normalizing your tables:

First Sign Of Problems: Prefixed Columns

Do you have columns with similar prefixes?

If you have naming patterns like this, it’s time to look at splitting those columns out.

I took the Users and Posts tables from Stack Overflow and mangled them a bit to look like this.

You may not have tables with this explicit arrangement, but it could be implied all over the place.

One great way to tell is to look at your indexes. If certain groups of columns are always indexed together, or if there are lots of missing index requests for certain groups of columns, it may be time to look at splitting them out into different tables.

These things tend to happen, but they can have serious negative consequences for database performance, not to mention the risk of bad data sneaking in.

Normalizing To Boyce-Codd Normal Form

I am a big fan of Boyce-Codd Normal Form:

Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms.  There are a couple of requirements to be in Boyce-Codd Normal Form.  First, your table must be in First Normal Form.  This means that:

  • Every entity (row) has a consistent shape.  This is something relational databases do for you automatically:  you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
  • Every entity has a unique value.  You can uniquely identify any particular row.
  • Every attribute is atomic:  you don’t try to pack more than one value into a single attribute.
  • There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.

The other half of BCNF is that every determinant on an entity is a key.

Also click through for an iterative, easy-to-follow process to get to BCNF.

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Normalize

I walk through a scenario which underscores the importance of normalization:

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.

Anchor Modeling

Steph Locke has a presentation on Anchor Modeling as 6th Normal Form:

Anchor Modelling moves you beyond third normal form and into sixth normal form. What does this mean? Essentially it means that an attribute is stored independently against the key, not in a big table with other attributes. This means you can easily store metadata about that attribute and do full change tracking with ease. The historical problem with this methodology is that it makes writing queries a real pain. Anchor Modelling overcomes this by providing views that combine all the attribute data together.

Anchor Modeling is a rather different approach, so if it sounds interesting, check out the tutorial.

Categories

December 2018
MTWTFSS
« Nov  
 12
3456789
10111213141516
17181920212223
24252627282930
31