Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Related Posts

Interactive ggplot Plots with plotly

Laura Ellis takes us through ggplotly: As someone very interested in storytelling, ggplot2 is easily my data visualization tool of choice. It is like the Swiss army knife for data visualization. One of my favorite features is the ability to pack a graph chock-full of dimensions. This ability is incredibly handy during the data exploration […]

Read More

Goodbye, gather and spread; Hello pivot_long and pivot_wide

John Mount covers a change in tidyr which mimics Mount and Nina Zumel’s pivot_to_rowrecs and unpivot_to_blocks functions in the cdata package: If you want to work in the above way we suggest giving our cdatapackage a try. We named the functions pivot_to_rowrecs and unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms […]

Read More


December 2018
« Nov Jan »