Partitioning Data For Performance Improvement In R

Kevin Feasel



John Mount shares a few examples of partitioning and parallelizing data operations in R:

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatabledata.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallelconveniently bundles all of these steps together when working with rquery pipelines.

There were some interesting results.  I expected data.table to be fast, but did not expect dplyr to parallelize so well.

Related Posts

xgboost and Small Numbers of Subtrees

John Mount covers an interesting issue you can run into when using xgboost: While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).In doing that I ran into one more […]

Read More

Reinforcement Learning with R

Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning: At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one […]

Read More


July 2018
« Jun Aug »