The Use And Misuse Of P Values

John Mount and Nina Zumel explain what p-values are and how people routinely misuse them:

The many things I happen to have issues with in common mis-use of p-values include:

  1. p-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.

  2. “Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity). For example: saying p instead of saying what you are testing such as “significance of a null hypothesis”.

  3. Logical fallacies. This is the (false) claim that p being low implies that the probability that your model is good is high. At best a low-p eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).

  4. Confusion of population and individual statistics. This is the use of deviation of sample means (which typically decreases as sample size goes up) when deviation of individual differences (which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

Even if you know what p-values are, this is definitely worth reading, as it’s so easy to misuse p-values (even when I’m not on my Bayesian post hurling tomatoes at frequentists).

Related Posts

Tidy Anomaly Detection With Anomalize

Abdul Majed Raja walks us through an example using the anomalize package: One of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components. anomalize has got a function time_decompose() to perform the same. […]

Read More

Uploading Data Sets To Azure ML From R

Leila Etaati continues her series on the Azure ML R package by showing how to upload a data set: There is a function in AzureML package name “workspace” that creates a reference to an AzureML Studio workspace by getting the authentication token and workspace id as below: 1 ws <– workspace( id , auth  ) to […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930