RStudio has announced an interface between R and Apache Spark, named sparklyr:
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
Interactively manipulate Spark data using both dplyr and SQL (via DBI).
Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
Create extensions that call the full Spark API and provide interfaces to Spark packages.
Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.
So what’s the difference between sparklyr and SparkR?
@zedoring sparkR is “inspired by dplyr” and distributed with Spark, sparklyr is a proper dplyr back-end which will be on CRAN.
— Jeff Allen (@TrestleJeff) June 28, 2016
This might be the package I’ve been awaiting.