John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Related Posts

Using cdata To Created Faceted Plots

Nina Zumel shows how to use the cdata package to create faceted ggplot2 plots: First, load the packages and data: library("ggplot2") library("cdata") iris <- data.frame(iris) Now define the data-shaping transform, or control table. The control table is basically a picture that sketches out the final data shape that I want. I want to specify the x and y columns of the plot […]

Read More

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More


December 2016
« Nov Jan »