PySpark Persistence

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar.  The Java version of the code would be seven pages long.

Related Posts

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R: From the plots above, one can see that: the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is […]

Read More

Auto-Encoders And KernelML

Rohan Kotwani gives us an example where KernelML might be better than TensorFlow or PyTorch: So what’s the point of using KernelML? 1. The parameters in each layer can be non-linear 2. Each parameter can be sampled from a different random distribution 3. The parameters can be transformed to meet certain constraints 4. Network combinations […]

Read More


April 2017
« Mar May »