Press "Enter" to skip to content

Category: R

Validating Errors in A/B Testing

Roland Stevenson shows us how to validate Type I and Type II errors when performing A/B tests in R:

In this post, we seek to develop an intuitive sense of what type I (false-positive) and type II (false-negative) errors represent when comparing metrics in A/B tests, in order to gain an appreciation for “peeking”, one of the major problems plaguing the analysis of A/B test today.

To better understand what “peeking” is, it helps to first understand how to properly run a test. We will focus on the case of testing whether there is a difference between the conversion rates cr_a and cr_b for groups A and B. We define conversion rate as the total number of conversions in a group divided by the total number of subjects. The basic idea is that we create two experiences, A and B, and give half of the randomly-selected subjects experience A and half B. Then, after some number of users have gone through our test, we measure how many conversions happened in each group. The important question is: how many users do we need to have in groups A and B in order to measure a difference in conversion rates of a particular size?

Read the whole thing. H/T R-Bloggers

Comments closed

Microsoft ML Server 9.4

Jeroen Ter Heerdt announces Microsoft Machine Learning Server 9.4:

Today we’re excited to announce our latest Microsoft Machine Learning Server 9.4 release, which addresses popular customer requests as well as developments in the R and Python community.

Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features and algorithmic innovation that brings the best of open source and proprietary worlds together.

This is the best way to bind new versions of R and Python to your SQL Server ML Services installation.

Comments closed

Nowcasting Unemployment

Peter Ellis takes us through an attempt to perform near-term projection of Australian unemployment rates based on macroeconomic indicators:

“Leading” in this case will have to mean pretty fast, because the official unemployment stats in Australia come out from the Australian Bureau of Statistics (ABS) with admirable promptitude given the complexity of managing the Labour Force Survey. ABS Series 6202.0 – the monthly summary from the Labour Force Survey – comes out around two weeks after the reference month. Only a few economic variables of interest are available faster than that. In this blog post I look at two candidates for leading information that are readily available in more or less real time – interest rates and stock exchange prices.

One big change in the past decade in this sort of short-term forecasting of unemployment has been to model the transitions between participation, employed and unemployed people, rather than direct modelling of the resulting proportions. This innovation comes from an interesting 2012 paper by Barnichon and Nekarda. I’ve only skimmed this paper, but I’d like to look into how much of the gains they report comes from the focus on workforce transitions, and how much from their inclusion of new information in the form of vacancy postings and claims for unemployment insurance. My suspicion is that these latter two series have powerful new information. I will certainly be returning to vacancy information and job adverts at a later time – these are items which feature prominently for me in my day job at Nous Group in analysing the labour market.

This gets a little deep but it’s well worth the read. H/T R-bloggers

Comments closed

An Intro to k-Means Clustering

Holger von Jouanne-Diedrich takes us through an example of how k-means clustering works:

The guiding principles are:

– The distance between data points within clusters should be as small as possible.
– The distance of the centroids (= centres of the clusters) should be as big as possible.

Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach

Click through for a demonstration. I appreciate adding visualizations for intermediate steps in there as well because it gives you an intuitive understanding for what the one-liner function is really doing.

Comments closed

Polishing Uncalibrated Models

Nina Zumel takes an uncalibrated random forest model and applies a calibration technique to improve the estimate on one variable:

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This “polishing step” potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

This is a great explanation of the process as well as its risks and limitations.

Comments closed

Generating Excel Spreadsheets from Shiny

Richard Hill and Andy Merlino show how you can export data from a Shiny app into Excel:

R is great for report generation. Shiny allows us to easily create web apps that generate a variety of reports with R.

This post details a demo Shiny app that generates an Excel report, a PowerPoint report, and a PDF report:

The full Shiny app source code is available here. Also, we included a more basic Shiny app that generates an Excel report at the end of this post. Follow up posts will include similar simple Shiny apps generating PowerPoint and PDF reports.

Excel is still the most popular business intelligence tool and Excel support tends to be one of the first requests people get with third-party apps, so it’s good to know you can do this in Shiny without too much rigmarole.

Comments closed

xgboost and Small Numbers of Subtrees

John Mount covers an interesting issue you can run into when using xgboost:

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!

It’s not something you’ll hit very often, but if you’re trying xgboost against a small enough data set with few enough rounds, it is something to keep in mind.

Comments closed

Reinforcement Learning with R

Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning:

At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?

There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable e=0.1 is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.

Click through for various cases and a pathfinding example in R. H/T R-Bloggers

Comments closed

Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased:

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.

However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.

Read on for an example.

Comments closed

R 3.6.1 Available

David Smith notes a new version of R is available:

On July 5, the R Core Group released the source code for the latest update to R, R 3.6.1, and binaries are now available to download for Windows, Linux and Mac from your local CRAN mirror.

R 3.6.1 is a minor update to R that fixes a few bugs. As usual with a minor release, this version is backwards-compatible with R 3.6.0 and remains compatible with your installed packages. 

Click through for the changes. There is one nice addition around writeClipboard but otherwise it’s a release where you probably update if you’re bothered by a bug it fixes and otherwise skip.

Comments closed