Press "Enter" to skip to content

Category: R

The Problem With Meta-Packages

John Mount has a critique of meta-packages:

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

I’m not really sold on Jones’s argument, but I do think Mount has a good point.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

  • Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.
  • An infinite time horizon is assumed when calculating the present value of future cash flows.
  • The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed

Highlighting Data With gghighlight

Laura Ellis shows off the gghighlight package, which allows you to highlight selectively certain sets of data in ggplot:

While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame.  Further, you have to tinker more with the labelling to really call out the highlighted data points.

Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!!   Please note that a lot of this code was created by looking at examples on her introduction document.

The new school way is even simplier:

  1. Using ggplot2, create a plot with your full data set.

  2. Add the gghighlight() function to your plot with the conditions set to identify your subset.

  3. Celebrate! This was one less step AND we got labels!

That’s a very cool package.  H/T R-Bloggers

Comments closed

Stoppable, Async Shiny Interfaces

Ian at Fells Stats wants to make a long-running Shiny app a bit more user-friendly:

Shiny operates in a reactive programming framework. Fundamentally this means that any time any UI element that affects the result changes, so does the result. This happens automatically, with your analysis code running every time a widget is changed. In a lot of cases, this is exactly what you want and it makes Shiny programs concise and easy to make; however in the case of long running processes, this can lead to frozen UI elements and a frustrating user experience.

The easiest solution is to use an Action Button and only run the analysis code when the action button is clicked. Another important component is to provide your user with feedback as to how long the analysis is going to take. Shiny has nice built in progress indicators that allow you to do this.

There are a couple of false starts in there but by the time you reach the third act, the story makes sense.  H/T R-Bloggers

Comments closed

Classes And Vectors In R

Dave Mason continues his journey toward learning R.  He looks next at the class() function:

Note the value assigned to horse_power is a whole number (integer) and the value assigned to miles_per_gallon is a rational number. But R tells us they are both of the “numeric” class. R does have an integer class. A variable’s class will be an integer if the value is followed by a capital “L”. Let’s reassign a value to horse_power to demonstrate:

> horse_power <- 240L
> class(horse_power)
[1] "integer"

Another way to determine the class of a variable is to use one of the is.*() functions. For example, is.integer() and is.numeric() tell us the miles_per_gallon is not an integer, and is a numeric:

> is.integer(miles_per_gallon)
[1] FALSE
> is.numeric(miles_per_gallon)
[1] TRUE

There’s also the typeof() function and the mode() function, and all three can differ under certain circumstances.

Next up, Dave hits vectors, the simplest of the interesting data types in R:

It’s important to know that the elements of a vector must be of the same class (data type). If the values passed to the c() function are of different classes, some of them will be coerced to a different class to ensure all classes of the vector are the same. Below, the parameter classes passed to the c() function include character, numeric, and integer. The corresponding numeric and integer parameter values are coerced to character within the vector:

> some_data <- c("a", "b", 7.5, 25L)
> some_data
[1] "a"   "b"   "7.5" "25" 
>

Read on for more about vectors.

Comments closed

debugr: Debugging In R

Joachim Zuckarelli announces a new R package, debugr:

debugr is a new package designed to support debugging in R. It mainly provides the dwatch() function which prints a debug output to the console or to a file. A debug output can consist of a static text message, the values of one or more objects (potentially transformed by applying some functions) or the value of one or multiple (more complex) R expressions.

Whether or not a debug message is displayed can be made dependent on the evaluation of a criterion phrased as an R expression. Generally, debug messages are only shown if the debug mode is activated. The debug mode is activated and deactivated with debugr_switchOn() and debugr_switchOff(), respectively, which change the logical debugr.active value in the global options. Since debug messages are only displayed in debug mode, the dwatch() function calls can even remain in the original code as they remain silent and won’t have any effect until the debug mode is switched on again.

Click through for links to additional resources.  It looks like an interesting way of tracing problems in more error-prone segments of code.  H/T R-Bloggers

Comments closed

What Is R?

Dave Mason has started a new blog and hits the heavy topic first:

For anyone that has no idea what R is, comparisons to scripting languages like PowerShell, javascript, vbscript, or even DOS batch/cmd files might be helpful. I feel there are enough commonalities, at least conceptually at a high level, for the comparison to be appropriate. We’ve already seen some differences, though. The <- assignment operator sure is weird. I recall Oracle’s PL/SQL used := as an assignment operator. Almost all other languages I remember coding with use the near-universal = (equals sign). Using <- will take some time getting used to.

Those R variables used in this post are declared without a data type. But they do have underlying types, which I’ll cover in another post. If I remember correctly, javascript doesn’t have types–everything is an object (please leave a comment if this is wrong and I’ll correct the post later). Vbscript used “var”s for everything, although you could coerce data types with functions like CInt, CBool, etc.

The way I like to describe R is as two things:  first, it is a domain-specific language dedicated to statistical analysis; and second, that it is a functional programming language (though not a pure functional language).

Comments closed

Using rquery On Databricks

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Comments closed

Explaining Text Classification Models With LIME

Shirin Glander shows us how to use LIME to explain which words help us classify whether a user liked a particular item:

Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

  • the text input that was used to construct the model
  • the trained model
  • the preprocessing function
explainer <- lime(clothing_reviews_train$text, 
                  xgb_model, 
                  preprocess = get_matrix)

With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

I hadn’t used LIME for this before, and it looks very interesting.  H/T R-Bloggers

Comments closed

Visualizing Linear Regression Results

Bernardo Lares gives us a few ways of interpreting visually a linear regression result in R:

The most obvious plot to study for a linear regression model, you guessed it, is the regression itself. If we plot the predicted values vs the real values we can see how close they are to our reference line of 45° (intercept = 0, slope = 1). If we’d had a very sparse plot where we can see no clear tendency over that line, then we have a bad regression. On the other hand, if we have all our points over the line, I bet you gave the model your wished results!

Then, the Adjusted R2 on the plot gives us an easy parameter for us to compare models and how well did it fits our reference line. The nearer this value gets to 1, the better. Without getting too technical, if you add more and more useless variables to a model, this value will decrease; but, if you add useful variables, the Adjusted R-Squared will improve.

We also get the RMSE and MAE (Root-Mean Squared Error and Mean Absolute Error) for our regression’s results. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. On the other side we have RMSE, which is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

I like this approach to explaining models.

Comments closed