Part of the
vtreatphilosophy is to assume after the
vtreatvariable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.
Read on to see a couple quick functions which help with this screening.
The Small World of Words project focuses on word associations. You can try it out for yourself to see how it works, but the general idea is that the participant is presented with a word (from “telephone” to “journalist” to “yoga”) and is then asked to give their immediate association with that word. The project has collected more than 15 million responses to date, and is still collecting data. You can check out some pre-built visualizations the researchers have put together to explore the dataset, or you can download the data for yourself.
It’s an interesting analysis of the data set, mixed in with some good R code.
Plotting univariate (sampled) normal data
Well, that’s obvious.
d %>% ggplot(aes(x = X1)) + geom_density()
It gets much less obvious from there. It was also interesting learning about
ggplotly, a function which translates ggplot2 visuals to plotly visuals.
If you run this code, you should see a lot of output indicating that R is downloading, compiling and installing randomForest, and finally that the image is being pushed to Azure. (You will see this output even if your machine already has the randomForest package installed. This is because the package is being installed to the R session inside the container, which is distinct from the one running the code shown here.)
All docker calls in AzureContainers, like the one to build the image, return the actual docker commandline as the
cmdlineattribute of the (invisible) returned value. In this case, the commandline is
docker build -t bos_rf .Similarly, the
push()method actually involves two Docker calls, one to retag the image, and the second to do the actual pushing; the returned value in this case will be a 2-component list with the command lines being
docker tag bos_rf deployreg.azurecr.io/bos_rfand
docker push deployreg.azurecr.io/bos_rf.
I love this confluence of technologies and at the same time get a “descent into madness” feeling from the sheer number of worlds colliding.
Patrick Bajari and Gregory Lewis have collected a detailed sample of 466 road construction projects in Minnesota to study this question in their very interesting article Moral Hazard, Incentive Contracts and Risk: Evidence from Procurement in the Review of Economic Studies, 2014.
They estimate a structural econometric model and find that changes in contract design could substantially reduce the duration of road blockages and largely increase total welfare at only minor increases in the risk that road construction firms face.
As part of his Master Thesis at Ulm University, Claudius Schmid has generated a nice and detailed RTutor problem set that allows you to replicate the findings in an interactive fashion. You learn a lot about the structure and outcomes of the currently used contracts, the theory behind better contract design and how the structural model to assess the quantitative effects can be estimated and simulated. At the same time, you can hone your general data science and R skills.
Click through to a couple of ways to get to this RTutor project and learn a bit about building incentive contracts to modify behavior. H/T R-Bloggers
This is code that accompanies a book chapter on customer churn that I have written for the German dpunkt Verlag. The book is in German and will probably appear in February: https://www.dpunkt.de/buecher/13208/9783864906107-data-science.html.
The code you find below can be used to recreate all figures and analyses from this book chapter. Because the content is exclusively for the book, my descriptions around the code had to be minimal. But I’m sure, you can get the gist, even without the book. 😉
Click through for the code. This is using the venerable AT&T customer churn data set.
I benefit from the work of the R Foundation. They oversee the language, but also encourage a healthy ecosystem. CRAN happens because of them. Updates to R happen because of them. useR! happens because of them. Every day, you and I are the recipients of some part of their time.
The least we can do is show them some appreciation. If you point your web browser at https://www.r-project.org/foundation/donations.html you’ll find a convenient (and surprisingly inexpensive) place to express your appreciation. As an individual, you can send these kind folks twenty-five euros to tell them you’re in favor of what they do.
But be sure to read the whole thing, especially if you are an American who wants the donation to be tax-deductible. I believe that earmarking in this case is adding special instructions on SIAA’s PayPal page.
In this reproduction attempt we see:
dplyrtime being around 0.05 seconds. This is about 5 times slower than claimed.
sum()/n()time is about 0.2 seconds, about 5 times faster than claimed.
data.tabletime being around 0.004 seconds. This is about three times as fast as the
dplyrclaims, and over ten times as fast as the actual observed
Read the whole thing. If you want to replicate it yourself, check out the RMarkdown file.
For those of you who have been following along with issue #51 in the ggmap repo, you’ll notice that there have been a number of changes in the Google Maps Static API service. Unfortunately these have caused some breakage in previous ggmap functionality.
If you used this package prior to July 2018, you may were likely able to do so without signing up for the Google Static Map API service yourself. As indicated on the the ggmap github repo – “Google has recently changed its API requirements, and ggmap users are now required to provide an API key and enable billing. The billing enablement especially is a bit of a downer, but you can use the free tier without incurring charges. Also, the service being exposed through an easy to use r package that extends ggplot2 is pretty great so I’ll allow it.
This recent API change hurts. But click through for the tutorial, which doesn’t hurt.
To read more about getting started with
covrpagein your own package in a few lines of code only, we recommend checking out the “get started” vignette. It explains more how to setup the Travis deploy, mentions which functions power the
covrpagereport, and gives more motivation for using
And to learn how the information provided by
covrpageshould be read, read the “How to read the
Check it out.