Faster User-Defined Functions In SparkR

Kevin Feasel

2018-08-20

Hadoop, R, Spark

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

  • dapply()
  • dapplyCollect()
  • gapply()
  • gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Related Posts

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class: Let me tell you about one of my least favorite things I like to see in PolyBase: Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster This error is not limited to PolyBase but is instead […]

Read More

Interactive ggplot Plots with plotly

Laura Ellis takes us through ggplotly: As someone very interested in storytelling, ggplot2 is easily my data visualization tool of choice. It is like the Swiss army knife for data visualization. One of my favorite features is the ability to pack a graph chock-full of dimensions. This ability is incredibly handy during the data exploration […]

Read More

Categories

August 2018
MTWTFSS
« Jul Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031