Spark Changes In HDP 2.6

Vinay Shukla and Syed Mahmood talk about what’s new with Spark and Zeppelin in the Hortonworks Data Platform 2.6 update:

SPARKR & PYSPARK

Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark API to leverage Machine learning with Spark and to take advantage of distributed computations. Both SparkR & PySpark are evolving rapidly and SparkR now supports a number of machine learning algorithms such as LDA, ALS, RF, GMM GBT etc. Another key improvement in SparkR is the ability to deploy a package interactively. This will help Data Scientists deploy their favorite R package in their own environment without stepping on other users.

PySpark now also supports deploying VirtualEnv and this will allow PySpark users to deploy their libraries in their own individual deployments.

There are several large changes, so check it out.

Related Posts

Timing R Function Calls

Colin Gillespie shows off an R package for benchmarking: Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more […]

Read More

From pandas to Spark with koalas

Achilleus tries out Koalas: Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium […]

Read More

Categories