Simpson’s Paradox Explained

Mehdi Daoudi, et al, have a nice explanation of Simpson’s Paradox:

E.H. Simpson first described the phenomenon of Simpson’s paradox in 1951. The actual name “Simpson’s paradox” was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

The authors also provide a helpful example with operational metrics, showing how aggregating the data leads to an opposite (and invalid) conclusion.

Related Posts

Exploratory Data Analysis with inspectdf

Laura Ellis continues a dive into Exploratory Data Analysis, this time using the inspectdf package: I like this package because it’s got a lot of functionality and it’s incredibly straightforward to use. In short, it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Better yet, […]

Read More

Non-Linear Classifiers with Support Vector Machines

Rahul Khanna continues a series on support vector machines: In this blog post, we will look at a detailed explanation of how to use SVM for complex decision boundaries and build Non-Linear Classifiers using SVM. The primary method for doing this is by using Kernels. In linear SVM we find margin maximizing hyperplane with features […]

Read More


August 2017
« Jul Sep »