Category: R

Microsoft R Open 3.5.1

Published 2018-08-16 by Kevin Feasel

David Smith announces Microsoft R Open 3.5.1:

Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments.

This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package.

Read on for more and check out the updates.

Comments closed

Performing Linear Regression With Power BI

Published 2018-08-16 by Kevin Feasel

Jason Cantrell shows how to create a simple linear regression in Power BI:

Linear Regression is a very useful statistical tool that helps us understand the relationship between variables and the effects they have on each other. It can be used across many industries in a variety of ways – from spurring value to gaining customer insight – to benefit business.

The Simple Linear Regression model allows us to summarize and examine relationships between two variables. It uses a single independent variable and a single dependent variable and finds a linear function that predicts the dependent variable values as a function of the independent variables.

If you want real linear regression, drop in an R or Python script.

Comments closed

Principal Component Analysis With Faces

Published 2018-08-14 by Kevin Feasel

Mic at The Beginner Programmer shows us how to creepy PCA diagrams with human faces:

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).
Calculate the covariance matrix of the data.
Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.
Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.
Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.
Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA. H/T R-bloggers

Comments closed

Sorting With data.table Versus dplyr

Published 2018-08-14 by Kevin Feasel

John Mount shows us that data.table is way faster for sorting than dplyr‘s arrange function:

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

John has also provided us with a markdown document for comparison.

Comments closed

Matrices In R

Published 2018-08-14 by Kevin Feasel

Dave Mason continues his perusal of R data types, this time looking at the matrix:

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it’s a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):
> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
>  new_matrix <- rbind(row1, row2)
> new_matrix
     [,1] [,2] [,3] [,4]
row1 "1"  "2"  "3"  "4" 
row2 "a"  "b"  "c"  "d"

Matrices drive a large number of statistical techniques, though I tend to deal with them less directly than I would have imagined.

Comments closed

Binning And Recoding In R

Published 2018-08-10 by Kevin Feasel

Sebastian Sauer shows a few methods of practical data reshaping in R:

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Click through for examples of techniques you can use.

Comments closed

Working With Vectors In R

Published 2018-08-09 by Kevin Feasel

Dave Mason continues his quest to learn R, focusing on vectors. First, he looks at vector-based mathematical operations:

Now we can determine the number of customers gained vs number of customers lost (plus/minus) for each month of the quarter by subtracting one vector from another. Each vector has the same number of elements (three), and the result is also a vector of three elements:
> net_customer_gain <- new_customers - customers_lost
> net_customer_gain
Jan Feb Mar 
-15  30   3 
The sum() function can be used to add up all the elements of a vector. Below, we get the total number of new customers and lost customers for the first quarter:
> sum(new_customers)
[1] 270
> sum(customers_lost)
[1] 252

Then he shows off subsetting in vectors:

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:
> jersey_numbers[c(1,3,4)]
Pierce  Rondo  Allen 
    34      9     20  
The values of the integer vector can be in any order:
> jersey_numbers[c(4,1,3)]
 Allen Pierce  Rondo 
    20     34      9

Vectors are a critical part of understanding R.

Comments closed

The Problem With Meta-Packages

Published 2018-08-09 by Kevin Feasel

John Mount has a critique of meta-packages:

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

I’m not really sold on Jones’s argument, but I do think Mount has a good point.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Published 2018-08-07 by Kevin Feasel

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.

An infinite time horizon is assumed when calculating the present value of future cash flows.

The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed

Highlighting Data With gghighlight

Published 2018-08-02 by Kevin Feasel

Laura Ellis shows off the gghighlight package, which allows you to highlight selectively certain sets of data in ggplot:

While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame. Further, you have to tinker more with the labelling to really call out the highlighted data points.

Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!! Please note that a lot of this code was created by looking at examples on her introduction document.

The new school way is even simplier:

Using ggplot2, create a plot with your full data set.
Add the gghighlight() function to your plot with the conditions set to identify your subset.
Celebrate! This was one less step AND we got labels!

That’s a very cool package. H/T R-Bloggers

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31