PySpark Persistence

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar.  The Java version of the code would be seven pages long.

Related Posts

Optimal Image Colorization With Python

Sandipan Dey walks through a paper on colorization and shows some examples: Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors. This premise is formulated using a quadratic […]

Read More

Installing Apache Mesos On EC2

Anubhav Tarar has a guide for setting up Apache Mesos along with Spark and Hadoop on EC2: Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment. In […]

Read More


April 2017
« Mar May »