Kevin Feasel


R, Spark

RStudio has announced an interface between R and Apache Spark, named sparklyr:

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

  • Interactively manipulate Spark data using both dplyr and SQL (via DBI).

  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.

  • Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.

  • Create extensions that call the full Spark API and provide interfaces to Spark packages.

  • Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.

So what’s the difference between sparklyr and SparkR?

This might be the package I’ve been awaiting.

Related Posts


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More

R 3.4.4 Now Available

David Smith notes that R 3.4.4 is now generally available: R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause […]

Read More


October 2016
« Sep Nov »