Outliers In Histograms

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% mutate(x_new = ifelse(x > 10, 10, x)) %>% ggplot(aes(x_new)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

Related Posts

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R: PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the […]

Read More

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure: The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form. Now, in practice, Wickham argues, we tend to need to denormalize data because […]

Read More


April 2017
« Mar May »