Press "Enter" to skip to content

Category: R

Tol Color Schemes In R

Jason C. Fisher walks us through a color scheme generator based on Paul Tol’s research;

Choosing colors for a graphic is a bit like taking a trip down the rabbit hole, that is, it can take much longer than expected and be both fun and frustrating at the same time. Striking a balance between colors that look good to you and your audience is important. Keep in mind that color blindness affects many individuals throughout the world and it is incumbent on you to choose a color scheme that works in color-blind vision. Luckily there are a number of excellent R packages that address this very issue, such as the colorspace,RColorBrewer, and viridis packages. And because this is R, where diversity is king, why not offer one more function for creating color blind friendly palettes.

Let me introduce the GetTolColors function in the R-package inlmisc. This function generates a vector of colors from qualitative, diverging, and sequential color schemes by Paul Tol (2018). The original inspiration for developing this function came from Peter Carl’s blog post describing color schemes from an older issue of Paul Tol’s Technical Note (issue 2.2, released Dec. 2012). And the qualitative color schemes described in his blog post found their way into the ptol_pal function in the R-package ggthemes. My intent with this document is to exhibit the latest Tol color schemes (issue 3.0, released May 2018) and show that they are not only visually pleasing but also well thought out.

Read on for step-by-step instructions and to see some of the palettes.  The package authors have taken care in color design, so check it out.

Comments closed

Labeling Line Ends In ggplot2

Simon Jackson shows how you can use the secondary axis to label line endings in ggplot2:

Now we can use scale_y_*, with the argument sec.axis to create a second axis on the right, with numbers to be displayed at breaks, defined by our vector of line ends:

ggplot(d, aes(age, circumference, color = Tree)) +
      geom_line() +
      scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends))

This is good.  I’d really prefer to show the labels instead of the value; that way it’d be possible to eliminate the legend altogether.  H/T R-Bloggers.

Comments closed

R From The Year 2000

Colin Gillespie takes us down memory lane with some old, old code:

Last week I spent some time reminiscing about my PhD and looking through some old R code. This trip down memory lane led to some of my old R scripts that amazingly still run. My R scripts were fairly simple and just created a few graphs. However now that I’ve been programming in R for a while, with hindsight (and also things have changed), my original R code could be improved.

I wrote this code around April 2000. To put this into perspective,

  • R mailing list was started in 1997
  • R version 1.0 was released in Feb 29, 2000
  • The initial release of Git was in 2005
  • Twitter started in 2006
  • StackOverflow was launched in 2008

Basically, sharing code and getting help was much more tricky than today – so cut me some slack!

It’s a good sign when an arbitrary task becomes easier to understand as a language evolves.  And I’m glad they dumped the underscore assignment operator.

Comments closed

Reticulate: Python-R Interop

Adnan Fiaz walks us through an example of using the reticulate library to call Python from R:

So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality.

Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to librarybut instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported.

This is a good intro to a package which is already useful but I think will be even better over time as R & Python interoperability becomes the norm.  H/T R-Bloggers

Comments closed

Working With Data Frames In R

Dave Mason has a couple of blog posts on data frames.  First, the basics:

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call “observations”, and of columns , which are called “variables”. (Observations may also be referred to as “instances”. Variables may also be referred to as “properties”.) The data frame in R is designed for data sets. As the R documentation tells us, data frames are “used as the fundamental data structure by most of R’s modeling software”.

The function we’ll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

Then, subsetting data frames:

Adding columns to a data frame is easy–easy compared to adding rows. We’ll get to that. To add a column, first create a vector. The class doesn’t matter. But the number of elements does–it has to match the number of observations in the data frame. Now that we have our vector, here are some options to add it as a new column to a data frame: use the $ shortcut, use double brackets with the new column name, bind the vector to the dataframe with cbind().

The data frame (or tibble, if using the tidyverse version) is probably the single most important data type in R for getting work done.

Comments closed

Calculating Lifetime Value With R

Sergey Bryl shows how to calculate the lifetime value of a subscription service:

Predicting LTV is a common issue for a new, recently launched product/service/application when we don’t have a lot of historical data but want to calculate LTV as soon as possible. Even though we may have a lot of historical data on customer payments for a product that is active for years, we can’t really trust earlier stats since the churn curve and LTV can differ significantly between new customers and the current ones due to a variety of reasons.

Therefore, regardless of whether our product is new or “old”, we attract new subscribers and want to estimate what revenue they will generate during their lifetimes for business decision-making.

This topic is closely connected to the Cohort Analysis and if you are not familiar with the concept, I recommend that you read about it and look at other articles I wrote earlier on this blog.

Read the whole thing.

Comments closed

Interpreting The Area Under The Receiver Operating Characteristic Curve

Roos Colman explains what a Receiver Operating Characteristic (ROC) curve is and how we interpret the Area Under the Curve (AUC):

The AUC can be defined as “The probability that a randomly selected case will have a higher test result than a randomly selected control”. Let’s use this definition to calculate and visualize the estimated AUC.
In the figure below, the cases are presented on the left and the controls on the right.
Since we have only 12 patients, we can easily visualize all 32 possible combinations of one case and one control. (Rcode below)

Expanding from this easy-to-follow example, Colman walks us through some of the statistical tests involved.  Check it out.

Comments closed

Building A Neural Network In R With Keras

Pablo Casas walks us through Keras on R:

One of the key points in Deep Learning is to understand the dimensions of the vector, matrices and/or arrays that the model needs. I found that these are the types supported by Keras.

In Python’s words, it is the shape of the array.

To do a binary classification task, we are going to create a one-hot vector. It works the same way for more than 2 classes.

For instance:

  • The value 1 will be the vector [0,1]
  • The value 0 will be the vector [1,0]

Keras provides the to_categorical function to achieve this goal.

This example doesn’t include using CUDA, but the data sizes are small enough that it doesn’t matter much.  H/T R-Bloggers

Comments closed

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node:

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

  • Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.
  • Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.
  • You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

Click through for detailed, step-by-step instructions on how to do this.

Comments closed

Mutating Data Frames Without dplyr

John Mount points out that there is a built-in function to mutate data frames in R:

The notation we used above is the “explicit argument” variation we recommend for readability. What a lot of dplyr users do not seem to know is: base-R already has this functionality. The function is called transform().

To demonstrate this, let’s first detach dplyr to show that we are not using functions from dplyr.

detach("package:dplyr", unload = TRUE)

Now let’s write the equivalent pipeline using exclusively base-R.

Click through for the way to do this as a pipeline operation.

Comments closed