Disagreement On Outliers

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Related Posts

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package: The above code is assuming you have the wrapr package attached via already having run library('wrapr'). Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t […]

Read More

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way: Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, […]

Read More


March 2018
« Feb Apr »