You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.
Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.
Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.
First Sign Of Problems: Prefixed Columns
Do you have columns with similar prefixes?
I took the Users and Posts tables from Stack Overflow and mangled them a bit to look like this.
You may not have tables with this explicit arrangement, but it could be implied all over the place.
One great way to tell is to look at your indexes. If certain groups of columns are always indexed together, or if there are lots of missing index requests for certain groups of columns, it may be time to look at splitting them out into different tables.
These things tend to happen, but they can have serious negative consequences for database performance, not to mention the risk of bad data sneaking in.
Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms. There are a couple of requirements to be in Boyce-Codd Normal Form. First, your table must be in First Normal Form. This means that:
- Every entity (row) has a consistent shape. This is something relational databases do for you automatically: you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
- Every entity has a unique value. You can uniquely identify any particular row.
- Every attribute is atomic: you don’t try to pack more than one value into a single attribute.
- There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.
The other half of BCNF is that every determinant on an entity is a key.
Also click through for an iterative, easy-to-follow process to get to BCNF.
The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data. Specifically, Wickham argues that tidy data should achieve third normal form.
Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure: we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.
I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.
This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators). I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).
If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.
The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.
Anchor Modelling moves you beyond third normal form and into sixth normal form. What does this mean? Essentially it means that an attribute is stored independently against the key, not in a big table with other attributes. This means you can easily store metadata about that attribute and do full change tracking with ease. The historical problem with this methodology is that it makes writing queries a real pain. Anchor Modelling overcomes this by providing views that combine all the attribute data together.