RStudio has announced an interface between R and Apache Spark, named sparklyr:
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
-
Interactively manipulate Spark data using both dplyr and SQL (via DBI).
-
Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
-
Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
-
Create extensions that call the full Spark API and provide interfaces to Spark packages.
-
Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.
So what’s the difference between sparklyr and SparkR?
@zedoring sparkR is “inspired by dplyr” and distributed with Spark, sparklyr is a proper dplyr back-end which will be on CRAN.
— Jeff Allen (@TrestleJeff) June 28, 2016
This might be the package I’ve been awaiting.