Spark Changes In HDP 2.6

Vinay Shukla and Syed Mahmood talk about what’s new with Spark and Zeppelin in the Hortonworks Data Platform 2.6 update:


Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark API to leverage Machine learning with Spark and to take advantage of distributed computations. Both SparkR & PySpark are evolving rapidly and SparkR now supports a number of machine learning algorithms such as LDA, ALS, RF, GMM GBT etc. Another key improvement in SparkR is the ability to deploy a package interactively. This will help Data Scientists deploy their favorite R package in their own environment without stepping on other users.

PySpark now also supports deploying VirtualEnv and this will allow PySpark users to deploy their libraries in their own individual deployments.

There are several large changes, so check it out.

Related Posts

R Services Internal Communication Mechanisms

Niels Berglund continues his R Services internals series: When browsing for the symbols, you can use this command: x /1 *!TCP*. By using the option /1 you’ll only see the names, and no addresses. On my machine that gives me quite a lot, but there are two entries that catch my eye: sqllang!Tcp::AcceptConnection and sqllang!Tcp::Close. So let us set breakpoints at […]

Read More

Compacting Shared Libraries In R

Dirk Eddelbuettel compacts the tidyverse: Of course, there is a third way: just run strip --strip-debug over all the shared libraries after the build. As the path is standardized, and the shell does proper globbing, we can just do $ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so using a double-wildcard to get all packages (in that R package directory) and all their shared […]

Read More