Press "Enter" to skip to content

Category: R

Sorting With data.table Versus dplyr

John Mount shows us that data.table is way faster for sorting than dplyr‘s arrange function:

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

John has also provided us with a markdown document for comparison.

Comments closed

Matrices In R

Dave Mason continues his perusal of R data types, this time looking at the matrix:

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it’s a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):

> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
>  new_matrix <- rbind(row1, row2)
> new_matrix
     [,1] [,2] [,3] [,4]
row1 "1"  "2"  "3"  "4" 
row2 "a"  "b"  "c"  "d"

Matrices drive a large number of statistical techniques, though I tend to deal with them less directly than I would have imagined.

Comments closed

Binning And Recoding In R

Sebastian Sauer shows a few methods of practical data reshaping in R:

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Click through for examples of techniques you can use.

Comments closed

Working With Vectors In R

Dave Mason continues his quest to learn R, focusing on vectors.  First, he looks at vector-based mathematical operations:

Now we can determine the number of customers gained vs number of customers lost (plus/minus) for each month of the quarter by subtracting one vector from another. Each vector has the same number of elements (three), and the result is also a vector of three elements:

> net_customer_gain <- new_customers - customers_lost
> net_customer_gain
Jan Feb Mar 
-15  30   3 

The sum() function can be used to add up all the elements of a vector. Below, we get the total number of new customers and lost customers for the first quarter:

> sum(new_customers)
[1] 270
> sum(customers_lost)
[1] 252

Then he shows off subsetting in vectors:

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:

> jersey_numbers[c(1,3,4)]
Pierce  Rondo  Allen 
    34      9     20  

The values of the integer vector can be in any order:

> jersey_numbers[c(4,1,3)]
 Allen Pierce  Rondo 
    20     34      9

Vectors are a critical part of understanding R.

Comments closed

The Problem With Meta-Packages

John Mount has a critique of meta-packages:

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

I’m not really sold on Jones’s argument, but I do think Mount has a good point.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

  • Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.
  • An infinite time horizon is assumed when calculating the present value of future cash flows.
  • The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed

Highlighting Data With gghighlight

Laura Ellis shows off the gghighlight package, which allows you to highlight selectively certain sets of data in ggplot:

While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame.  Further, you have to tinker more with the labelling to really call out the highlighted data points.

Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!!   Please note that a lot of this code was created by looking at examples on her introduction document.

The new school way is even simplier:

  1. Using ggplot2, create a plot with your full data set.

  2. Add the gghighlight() function to your plot with the conditions set to identify your subset.

  3. Celebrate! This was one less step AND we got labels!

That’s a very cool package.  H/T R-Bloggers

Comments closed

Stoppable, Async Shiny Interfaces

Ian at Fells Stats wants to make a long-running Shiny app a bit more user-friendly:

Shiny operates in a reactive programming framework. Fundamentally this means that any time any UI element that affects the result changes, so does the result. This happens automatically, with your analysis code running every time a widget is changed. In a lot of cases, this is exactly what you want and it makes Shiny programs concise and easy to make; however in the case of long running processes, this can lead to frozen UI elements and a frustrating user experience.

The easiest solution is to use an Action Button and only run the analysis code when the action button is clicked. Another important component is to provide your user with feedback as to how long the analysis is going to take. Shiny has nice built in progress indicators that allow you to do this.

There are a couple of false starts in there but by the time you reach the third act, the story makes sense.  H/T R-Bloggers

Comments closed

Classes And Vectors In R

Dave Mason continues his journey toward learning R.  He looks next at the class() function:

Note the value assigned to horse_power is a whole number (integer) and the value assigned to miles_per_gallon is a rational number. But R tells us they are both of the “numeric” class. R does have an integer class. A variable’s class will be an integer if the value is followed by a capital “L”. Let’s reassign a value to horse_power to demonstrate:

> horse_power <- 240L
> class(horse_power)
[1] "integer"

Another way to determine the class of a variable is to use one of the is.*() functions. For example, is.integer() and is.numeric() tell us the miles_per_gallon is not an integer, and is a numeric:

> is.integer(miles_per_gallon)
[1] FALSE
> is.numeric(miles_per_gallon)
[1] TRUE

There’s also the typeof() function and the mode() function, and all three can differ under certain circumstances.

Next up, Dave hits vectors, the simplest of the interesting data types in R:

It’s important to know that the elements of a vector must be of the same class (data type). If the values passed to the c() function are of different classes, some of them will be coerced to a different class to ensure all classes of the vector are the same. Below, the parameter classes passed to the c() function include character, numeric, and integer. The corresponding numeric and integer parameter values are coerced to character within the vector:

> some_data <- c("a", "b", 7.5, 25L)
> some_data
[1] "a"   "b"   "7.5" "25" 
>

Read on for more about vectors.

Comments closed

debugr: Debugging In R

Joachim Zuckarelli announces a new R package, debugr:

debugr is a new package designed to support debugging in R. It mainly provides the dwatch() function which prints a debug output to the console or to a file. A debug output can consist of a static text message, the values of one or more objects (potentially transformed by applying some functions) or the value of one or multiple (more complex) R expressions.

Whether or not a debug message is displayed can be made dependent on the evaluation of a criterion phrased as an R expression. Generally, debug messages are only shown if the debug mode is activated. The debug mode is activated and deactivated with debugr_switchOn() and debugr_switchOff(), respectively, which change the logical debugr.active value in the global options. Since debug messages are only displayed in debug mode, the dwatch() function calls can even remain in the original code as they remain silent and won’t have any effect until the debug mode is switched on again.

Click through for links to additional resources.  It looks like an interesting way of tracing problems in more error-prone segments of code.  H/T R-Bloggers

Comments closed