Defining Tidy Data

Kevin Feasel



John Mount shares thoughts about the concept of tidy data:

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

My spin on it is that tidy data is Boyce-Codd Normal Form but may subsequently be denormalized. This may reintroduce violations of 3NF (as in Mount’s example) and sometimes 2NF, but does not change the shape of the variables themselves—that is, a variable still represents a single thing and exists per observation.

Related Posts


Nina Zumel announces a new version of WVPlots on CRAN: WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that the older visualizations had our preferred color schemes hard-coded in. More recent additions to the package sometimes had palette […]

Read More

Icon Maps in R

Laura Ellis shows how you can build maps full of little icons: That was ok, but we should try to make the images more aesthetically pleasing using the magick package. We make each image transparent with the image_transparent() function. We can also make the resulting image a specific color with image_colorize(). I then saved the […]

Read More


May 2019
« Apr Jun »