Exploring The MNIST Dataset

David Robinson performs exploratory data analysis on the MNIST digit database:

The challenge is to classify a handwritten digit based on a 28-by-28 black and white image. MNIST is often credited as one of the first datasets to prove the effectiveness of neural networks.

In a series of posts, I’ll be training classifiers to recognize digits from images, while using data exploration and visualization to build our intuitions about why each method works or doesn’t. Like most of my posts I’ll be analyzing the data through tidy principles, particularly using the dplyr, tidyr and ggplot2 packages. In this first post we’ll focus on exploratory data analysis, to show how you can better understand your data before you start training classification algorithms or measuring accuracy. This will help when we’re choosing a model or transforming our features.

Read on for the analysis.

Related Posts

Naive Bayes Against Large Data Sets

Catherine Bernadorne walks us through using Naive Bayes for sentiment analysis: The more data that is used to train the classifier, the more accurate it will become over time. So if we continue to train it with actual results in 2017, then what it predicts in 2018 will be more accurate. Also, when Bayes gives […]

Read More

Disambiguating The Confusion Matrix

John Cook walks through a set of valuable terms derived from the core components of the confusion matrix: How many terms are possible? There are four basic ingredients: TP, FP, TN, and FN. So if each term may or may not be included in a sum in the numerator and denominator, that’s 16 possible numerators […]

Read More


January 2018
« Dec Feb »