Press "Enter" to skip to content

Performance Tips when Working with Large Datasets in R

Mira Celine Klein continues a series on performance tuning R code:

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.

Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.