Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Related Posts

Where Machine Learning And Econometrics Collide

Dave Giles shares some thoughts on how machine learning and econometrics relate: What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)? Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d […]

Read More

AzureR Packages In Cran

David Smith points out that the Azure packages for R are now in CRAN: The suite of AzureR packages for interfacing with Azure services from R is now available on CRAN. If you missed the earlier announcements, this means you can now use the install.packages function in R to install these packages, rather than having to install from the […]

Read More


December 2018
« Nov Jan »