Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks. If you haven’t read the previous posts in this series, Introduction, Cluster Creation and Notebooks, they may provide some useful context. You can find the files from this post in our GitHub Repository. Let’s move on to the core of this post, DBFS.
As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. For this post, we’re going to talk about the storage layer underneath Azure Databricks, DBFS. Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS). This is exactly what DBFS is. Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work. We may dig deeper into HDFS in a later post. For now, you can read more about HDFS here and here.
Click through for more detail on DBFS.