Disagreement On Outliers

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Related Posts

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package: Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy […]

Read More

Interacting With SQL Server From Pandas

Tomaz Kastrun shows how to use pyodbc to interact with a SQL Server database from Pandas: In the SQL Server Management Studio (SSMS), the ease of using external procedure sp_execute_external_script has been (and still will be) discussed many times. But the reason for this short blog post is the fact that, changing Python environments using Conda package/module management within Microsoft […]

Read More


March 2018
« Feb Apr »