Using rquery To Speed Up Data Manipulations

Kevin Feasel



John Mount shows off some rquery benchmarks versus dplyr and data.table:

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

There are some nice early results, so it’ll be interesting to watch as this develops.

Related Posts

Faster User-Defined Functions In SparkR

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime: SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame dapply() dapplyCollect() gapply() gapplyCollect() dapply() allows you to run an R function on each partition of the SparkDataFrame and returns […]

Read More

Subsetting Matrices In R

Dave Mason continues his look at matrices in R: We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty. > #Points scored by Kendrick Perkins. > points_scored_by_quarter[1,] 1st 2nd 3rd 4th […]

Read More


January 2018
« Dec Feb »