Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Related Posts

Linear Regression Assumptions

Stephanie Glen has a chart which explains the four key assumptions behind when Ordinary Least Squares is the Best Linear Unbiased Estimator: If any of the main assumptions of linear regression are violated, any results or forecasts that you glean from your data will be extremely biased, inefficient or misleading. Navigating all of the different assumptions […]

Read More

Visualizing with Heatmaps in R

Anisa Dhana shows how you can create a quick heatmap plot in R: To give your own colors use the scale_fill_gradientn function.ggplot(dat, aes(Age, Race)) + geom_raster(aes(fill = BMI)) + scale_fill_gradientn(colours=c("white", "red")) This is a quick example using ggplot2 but there are other heatmap libraries available too.

Read More


December 2018
« Nov Jan »