Faster User-Defined Functions In SparkR

Kevin Feasel

2018-08-20

Hadoop, R, Spark

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

  • dapply()
  • dapplyCollect()
  • gapply()
  • gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Related Posts

What’s New In Ambari 2.7

Paul Codding and Kat Petre share some of the new features in Ambari 2.7: With this release, we wanted to make Ambari more enjoyable to use every day, simplify finding and using our API, and unblock teams managing very large clusters.  Here is a preview of a few features we’re excited to share with you:Revamped […]

Read More

Deploying An R Service To Azure Kubernetes Service

Hong Ooi shows us how we can use Azure Container Registry and Azure Kubernetes Service to deploy an R model via Plumber: If you run this code, you should see a lot of output indicating that R is downloading, compiling and installing randomForest, and finally that the image is being pushed to Azure. (You will […]

Read More

Categories

August 2018
MTWTFSS
« Jul Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031