John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Related Posts

Using ggpairs To Find Correlations Between Variables In R

Akshay Mahale shows how to use the ggpairs function in R to see the correlation between different pairs of variables: From the above matrix for iris we can deduce the following insights: Correlation between Sepal.Length and Petal.Length is strong and dense. Sepal.Length and Sepal.Width seems to show very little correlation as datapoints are spreaded through out the plot area. Petal.Length and Petal.Width also shows strong correlation. Note: The […]

Read More

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package: Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy […]

Read More


March 2018
« Feb Apr »