John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Related Posts

Using Plotly In Power BI

Kara Annanie shows how you can R integration in Power BI to push Plotly visuals to users: In the example, above, we’ve created a line chart visualization using Plotly and we’ve decided to put labels on the graph, but only on the first and last points of the line graph. This graph would be particularly […]

Read More

P-Hacking and Multiple Comparison Bias

Patrick David has a great article on hypothesis testing, p-hacking, and multiple comparison bias: The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:“Could the most extreme value happen by chance?”The most extreme value we define as the greatest absolute AMVR deviation from […]

Read More


December 2016
« Nov Jan »