Press "Enter" to skip to content

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data:

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

My spin on it is that tidy data is Boyce-Codd Normal Form but may subsequently be denormalized. This may reintroduce violations of 3NF (as in Mount’s example) and sometimes 2NF, but does not change the shape of the variables themselves—that is, a variable still represents a single thing and exists per observation.