Press "Enter" to skip to content

Unintentional Data

Eric Hollingsworth describes data science as the cost of collecting data approaches zero:

Thankfully not only have modern data analysis tools made data collection cheap and easy, they have made the process of exploratory data analysis cheaper and easier as well. Yet when we use these tools to explore data and look for anomalies or interesting features, we are implicitly formulating and testing hypotheses after we have observed the outcomes. The ease with which we are now able to collect and explore data makes it very difficult to put into practice even basic concepts of data analysis that we have learned — things such as:

  • Correlation does not imply causation.
  • When we segment our data into subpopulations by characteristics of interest, members are not randomly assigned (rather, they are chosen deliberately) and suffer from selection bias.
  • We must correct for multiple hypothesis tests.
  • We ought not dredge our data.

All of those principles are well known to statisticians, and have been so for many decades. What is newer is just how cheap it is to posit hypotheses. For better and for worse, technology has led to a democratization of data within organizations. More people than ever are using statistical analysis packages and dashboards, explicitly or more often implicitly, to develop and test hypotheses.

This is a thoughtful essay well worth reading.