S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Related Posts

When Paging To Disk Became Cool Again

The Netflix Technology Blog walks us through how they do caching on SSDs: Storing large amounts of data in volatile memory (RAM) is expensive. Modern disk technologies based on SSD are providing fast access to data but at a much lower cost when compared to RAM. Hence, we wanted to move part of the data out of […]

Read More

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930