Chen Hirsh wants to consolidate:
To write and read data faster, Spark splits the work between nodes in a cluster, each reading\writing part of the data. That’s why, in the screenshot above, there are 3 CSV files (That’s the files starting with “Part”, with a CSV extension), instead of 1. Note that this can also occur when working with a single node cluster since Spark splits the work into tasks.
This behavior is great if you intend to keep working with the CSV files in Databricks since reading will be faster. But if you want to share this file with someone outside of Databricks, this may be inconvenient.
Read on for two ways of doing this, as well as the price you pay to get it done.
Leave a Comment