Defining Tidy Data

John Mount shares thoughts about the concept of tidy data:

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

My spin on it is that tidy data is Boyce-Codd Normal Form but may subsequently be denormalized. This may reintroduce violations of 3NF (as in Mount’s example) and sometimes 2NF, but does not change the shape of the variables themselves—that is, a variable still represents a single thing and exists per observation.

Related Posts

Vectors for Programmers

John Mount has a couple of videos available: We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Click through […]

Read More

Visualizing Earthquake Data

Giorgio Garziano continues a series on analyzing earthquake data: This is the third part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show static, interactive and animated earthquakes maps of different flavors by […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

May 2019
MTWTFSS
« Apr  
 12345
6789101112
13141516171819
20212223242526
2728293031