Press "Enter" to skip to content

Category: R

From Excel to R: Three Examples

Abdul Majed Raja has a few examples of things which are easy to do in Excel and how you can do them in R:

Create a difference variable between the current value and the next value
This is also known as lead and lag – especially in a time series dataset this varaible becomes very important in feature engineering. In Excel, This is simply done by creating a new formula field and subtracting the next cell with the current cell or the current cell with the previous cell and dragging the cell formula to the last cell.

These things aren’t as easy to do as in Excel—it’s hard to get simpler than “push a button” or “click and drag your mouse”—but they are useful to know in R. H/T R-bloggers

Comments closed

Calculating AUC in R

Andrew Treadway shows how you can calculate Area Under the Curve in R:

AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link.

Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model’s performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R’s pROC package, which uses bootstrapping to calculate the interval.

There are plenty of ways to calculate this useful metric, but this is definitely one of the easier methods. H/T R-bloggers

Comments closed

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space:

There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged by more than 3% to achieve a 10% rating (TIOBE’s proprietary metric that primarily measures search activity), easily the biggest gain among the 20 most popular languages.

R, by contrast, has not fared well lately on the TIOBE Index, where it dropped from 8th place in January 2018 to become the 20th most popular language today, behind Perl, Swift, and Go. At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.

I’ll say that rumors of R’s demise are premature.

Comments closed

Local Randomness and R

Evgeni Chasnovski has a problem around generating random data:

Let’s say we have a deterministic (non-random) problem for which one of the solutions involves randomness. One very common example of such problem is a function minimization on certain interval: it can be solved non-randomly (like in most methods of optim()), or randomly (the simplest approach being to generate random set of points on interval and to choose the one with the lowest function value).

What is a “clean” way of writing a function to solve the problem? The issue with direct usage of randomness inside a function is that it affects the state of outer random number generation:

Click through for a solution which uses random numbers but doesn’t change the outside world’s random number generation after it’s done.

Comments closed

ggforce Updates

Thomas Lin Pedersen has some ggforce updates for us:

Now, the above plot has some obvious shortcomings. The diagonal is pretty useless for starters, and it is often that these panels are used to plot the distributions of the individual variables. Using e.g. geom_density() won’t work as it always start at 0, thus messing with the y-scale of each row. ggforce provides two new geoms tailored for the diagonal: geom_autodensity() and geom_autohistogram() which automatically positions itself inside the panel without affecting the y-scale. We’d still need to have this geom only in the diagonal, but facet_matrix() provides exactly this sort of control

There are some interesting improvements in here.

Comments closed

Options with stats::density() in R

Evgeni Chasnovski takes us through what the parameters in the stats::density() R function do:

Argument bw is responsible for computing bandwith of kernel density estimation: one of the main parameters that greatly affect the output. It can be specified as either algorithm of computation or directly as number. Because actual bandwidth is computed as adjust*bw(adjust is another density() argument, which is explored in the next section), here we will see how different algorithms compute bandwidths, and the effect of changing numeric value of bandwidth will be shown in section about adjust.

There are 5 available algorithms: “nrd0”, “nrd”, “ucv”, “bcv”, “SJ”. 

Evgeni has also created animations for each of these, so it’s easy to see what they do compared to the default output.

Comments closed

Validating Errors in A/B Testing

Roland Stevenson shows us how to validate Type I and Type II errors when performing A/B tests in R:

In this post, we seek to develop an intuitive sense of what type I (false-positive) and type II (false-negative) errors represent when comparing metrics in A/B tests, in order to gain an appreciation for “peeking”, one of the major problems plaguing the analysis of A/B test today.

To better understand what “peeking” is, it helps to first understand how to properly run a test. We will focus on the case of testing whether there is a difference between the conversion rates cr_a and cr_b for groups A and B. We define conversion rate as the total number of conversions in a group divided by the total number of subjects. The basic idea is that we create two experiences, A and B, and give half of the randomly-selected subjects experience A and half B. Then, after some number of users have gone through our test, we measure how many conversions happened in each group. The important question is: how many users do we need to have in groups A and B in order to measure a difference in conversion rates of a particular size?

Read the whole thing. H/T R-Bloggers

Comments closed

Microsoft ML Server 9.4

Jeroen Ter Heerdt announces Microsoft Machine Learning Server 9.4:

Today we’re excited to announce our latest Microsoft Machine Learning Server 9.4 release, which addresses popular customer requests as well as developments in the R and Python community.

Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features and algorithmic innovation that brings the best of open source and proprietary worlds together.

This is the best way to bind new versions of R and Python to your SQL Server ML Services installation.

Comments closed

Nowcasting Unemployment

Peter Ellis takes us through an attempt to perform near-term projection of Australian unemployment rates based on macroeconomic indicators:

“Leading” in this case will have to mean pretty fast, because the official unemployment stats in Australia come out from the Australian Bureau of Statistics (ABS) with admirable promptitude given the complexity of managing the Labour Force Survey. ABS Series 6202.0 – the monthly summary from the Labour Force Survey – comes out around two weeks after the reference month. Only a few economic variables of interest are available faster than that. In this blog post I look at two candidates for leading information that are readily available in more or less real time – interest rates and stock exchange prices.

One big change in the past decade in this sort of short-term forecasting of unemployment has been to model the transitions between participation, employed and unemployed people, rather than direct modelling of the resulting proportions. This innovation comes from an interesting 2012 paper by Barnichon and Nekarda. I’ve only skimmed this paper, but I’d like to look into how much of the gains they report comes from the focus on workforce transitions, and how much from their inclusion of new information in the form of vacancy postings and claims for unemployment insurance. My suspicion is that these latter two series have powerful new information. I will certainly be returning to vacancy information and job adverts at a later time – these are items which feature prominently for me in my day job at Nous Group in analysing the labour market.

This gets a little deep but it’s well worth the read. H/T R-bloggers

Comments closed