Neil Dewar has a notebook with ten important things when migrating from R to SparkR:
-
Apache Spark Building Blocks. A high-level overview of Spark describes what is available for the R user.
-
SparkContext, SQLContext, and SparkSession. In Spark 1.x, SparkContext and SQLContext let you access Spark. In Spark 2.x, SparkSession becomes the primary method.
-
A DataFrame or a data.frame? Spark’s distributed DataFrame is different from R’s local data.frame. Knowing the differences lets you avoid simple mistakes.
-
Distributed Processing 101. Understanding the mechanics of Big Data processing helps you write efficient code—and not blow up your cluster’s master node.
-
Function Masking. Like all R libraries, SparkR masks some functions.
-
Specifying Rows. With Big Data and Spark, you generally select rows in DataFrames differently than in local R data.frames.
-
Sampling. Sample data in the right way, and use it as a tool for converting between big and small data.
-
Machine Learning. SparkR has a growing library of distributed ML algorithms.
-
Visualization.It can be hard to visualize big data, but there are tricks and tools which help.
-
Understanding Error Messages. For R users, Spark error messages can be daunting. Knowing how to parse them helps you find the relevant parts.
I highly recommend checking out the notebook.