Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Related Posts

Scatterplots For Multivariate Analysis

Neil Saunders declutters a complicated visual with a simple scatterplot: Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes. Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, […]

Read More

Using ggpairs To Find Correlations Between Variables In R

Akshay Mahale shows how to use the ggpairs function in R to see the correlation between different pairs of variables: From the above matrix for iris we can deduce the following insights: Correlation between Sepal.Length and Petal.Length is strong and dense. Sepal.Length and Sepal.Width seems to show very little correlation as datapoints are spreaded through out the plot area. Petal.Length and Petal.Width also shows strong correlation. Note: The […]

Read More

Categories

February 2018
MTWTFSS
« Jan Mar »
 1234
567891011
12131415161718
19202122232425
262728