PySpark Persistence

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar.  The Java version of the code would be seven pages long.

Related Posts

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist: Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text […]

Read More

Connect(); Announcements, Including Azure Databricks

James Serra has a wrapup of Microsoft Connect(); announcements around the data platform space: Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made.  Here is a summary of the data platform related announcements: Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. […]

Read More


April 2017
« Mar May »