An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome:

Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the American Statistical Association wrote a long article about why to avoid p-values, and what you should do instead: see here.  For data scientists, obvious alternatives include re-sampling techniques: see here and here. One advantage is that they are model-independent, data-driven, and easy to understand. 

Here we explain how the manipulation and treachery works, using a simple simulated data set consisting of purely random, non-correlated observations. Using p-values, you can tell anything you want about the data, even the fact that the features are highly correlated, when they are not. The data set consists of 16 variables and 30 observations, generated using the RAND function in Excel. You can download the spreadsheet here.

And for a more academic treatment of the problem, I love this paper by Andrew Gelman and Eric Loken, particularly because it points out that you don’t have to have malicious intent to end up doing the wrong thing.

Related Posts

Linear Regression Assumptions

Stephanie Glen has a chart which explains the four key assumptions behind when Ordinary Least Squares is the Best Linear Unbiased Estimator: If any of the main assumptions of linear regression are violated, any results or forecasts that you glean from your data will be extremely biased, inefficient or misleading. Navigating all of the different assumptions […]

Read More

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions: One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

June 2019
MTWTFSS
« May  
 12
3456789
10111213141516
17181920212223
24252627282930