John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Related Posts

R Data Frames And stringsAsFactors

John Mount recommends setting stringsAsFactors = FALSE for data frames in R: R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string. Tibbles have this set by default.  For an explanation as to why it defaults to TRUE for data frames, Roger […]

Read More

The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process: As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages: Business Understanding Data Acquisition and Understanding Modeling Deployment Customer Acceptance This is not very different from the six major phases used by the Cross […]

Read More


December 2016
« Nov Jan »