Simpson’s Paradox Explained

Mehdi Daoudi, et al, have a nice explanation of Simpson’s Paradox:

E.H. Simpson first described the phenomenon of Simpson’s paradox in 1951. The actual name “Simpson’s paradox” was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

The authors also provide a helpful example with operational metrics, showing how aggregating the data leads to an opposite (and invalid) conclusion.

Related Posts

Interpreting The Area Under The Receiver Operating Characteristic Curve

Roos Colman explains what a Receiver Operating Characteristic (ROC) curve is and how we interpret the Area Under the Curve (AUC): The AUC can be defined as “The probability that a randomly selected case will have a higher test result than a randomly selected control”. Let’s use this definition to calculate and visualize the estimated […]

Read More

Naive Bayes Against Large Data Sets

Catherine Bernadorne walks us through using Naive Bayes for sentiment analysis: The more data that is used to train the classifier, the more accurate it will become over time. So if we continue to train it with actual results in 2017, then what it predicts in 2018 will be more accurate. Also, when Bayes gives […]

Read More


August 2017
« Jul Sep »