Performing Row-Wise Operations with pmap

Kevin Feasel



Sebastian Sauer shows how you can use pmap in the purrr library to perform row-wise aggregations:

Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.

This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.

The ideal solution is to have your data be properly columnar, but if you’re in a pinch, it’s good to know that you can do this.

Random Forest on Small Numbers of Observations

Neil Saunders takes us through an interesting problem:

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

Neil decided to explore the problem further regardless and came to some interesting conclusions.

Using Cohen’s D for Experiments

Nina Zumel takes us through Cohen’s D, a useful tool for determining effect sizes in experiments:

Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined as
d = | μ1 – μ2 | / σpooled
where σpooled is the pooled standard deviation over both cohorts.

Read the whole thing.

Comparing Iterator Performance in R

Kevin Feasel



Ulrik Stervbo has a performance comparison for for, apply, and map functions in R:

It is usually said, that for– and while-loops should be avoided in R. I was curious about just how the different alternatives compare in terms of speed.

The first loop is perhaps the worst I can think of – the return vector is initialized without type and length so that the memory is constantly being allocated.

The performance of map isn’t great, though the benefits to me are less about performance and more about readability. H/T R-bloggers

Visualizing with Heatmaps in R

Anisa Dhana shows how you can create a quick heatmap plot in R:

To give your own colors use the scale_fill_gradientn function.
ggplot(dat, aes(Age, Race)) +
geom_raster(aes(fill = BMI)) +
scale_fill_gradientn(colours=c("white", "red"))

This is a quick example using ggplot2 but there are other heatmap libraries available too.

Predicting Intermittent Demand

Bruno Rodrigues shows one technique for forecasting intermittent data:

Now, it is clear that this will be tricky to forecast. There is no discernible pattern, no trend, no seasonality… nothing that would make it “easy” for a model to learn how to forecast such data.

This is typical intermittent demand data. Specific methods have been developed to forecast such data, the most well-known being Croston, as detailed in this paper. A function to estimate such models is available in the {tsintermittent} package, written by Nikolaos Kourentzes who also wrote another package, {nnfor}, which uses Neural Networks to forecast time series data. I am going to use both to try to forecast the intermittent demand for the {RDieHarder} package for the year 2019.

Read the whole thing. H/T R-Bloggers

Learning R Versus Python

Kevin Feasel


Python, R

Andy Kirk shares the results of a rather informal Twitter poll:

Yesterday I ran a simple Twitter poll about the relative ease of learning R vs. Python. Although a correct answer to this query will ALWAYS have to be based on nuances like pre-existing skills and the scope of need, this originates from people telling me they encounter job or career profiles that list a need for R and/or Python. If they don’t have either, if they prioritised the pursuit of just one, which would be possible to develop a degree of competency more easily, more quickly and more efficiently?

Andy has also created a Twitter moment from the responses.

My thought, based only on the question itself, is that R would be better than Python because the hypothetical person has no additional programming skills. For someone with additional programming skills, the breakdown for me starts with, if your background is statistics, database development, or functional programming, you probably want R; if your background is object-oriented development or imperative programming, you probably want Python. And then it gets nuanced.

The Power of Hexagonal Binning

Capri Granville explains hexagonal binning to us and gives a few examples:

The reason for using hexagons is that it is still pretty simple, and when you rotate the chart by 60 degrees (or a multiple of 60 degrees) you still get the same visualization.  For squares, rotations of 60 degrees don’t work, only multiples of 90 degrees work. Is it possible to find a tessellation such that smaller rotations, say 45 or 30 degrees, leave the chart unchanged? The answer is no. Octogonal tessellations don’t really exist, so the hexagon is an optimum. 

Every time I see one of these, I think of old-timey strategy war games.

Matrix Operations in R

Kristian Larsen shows off some linear algebra concepts in R:

In this article, you learn how to do linear algebra in R. In particular, I will discuss how to create a matrix in R, Element-wise operations in R, Basic Matrix Operations in R, How to Combine Matrices in R, Creating Means and Sums in R and Advanced Matrix Operations in R.

The post is so chock-full of examples, the only block of multi-line text is the description.

An RStudio Configuration

Kevin Feasel



William Doane has published a sample RStudio configuration:

Whenever I need to install RStudio on a new machine, I have to think a bit about the configuration options I’ve tweaked. Invariably, I miss a checkbox that leaves me with slightly different RStudio behavior on each system. This post includes screenshots of my RStudio configuration and custom keyboard shortcuts for RStudio 1.3, MacOS, so that I have a reference.

I like these kinds of posts because they can help you find interesting settings you might not otherwise know about. Also, I second the FiraCode recommendation for R as well as F#. The only reason I don’t use it more is because I don’t want to confuse people during presentations. H/T R-Bloggers


August 2019
« Jul