S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Related Posts

Kafka Offset Management With Spark Streaming

Guru Medasana and Jordan Hambleton explain how to perform Kafka offset management when using Spark Streaming: Enabling Spark Streaming’s checkpoint is the simplest method for storing offsets, as it is readily available within Spark’s framework. Streaming checkpoints are purposely designed to save the state of the application, in our case to HDFS, so that it […]

Read More

Updates In Apache Kafka

Yeva Byzek announces that Apache Kafka 0.11.0.0 is shipping soon: We are very excited for the GA for Kafka release 0.11.0.0 which is just days away. This release is bringing many new features as described in the previous Log Compaction blog post. The most notable new feature is Exactly Once Semantics (EOS).  Kafka’s EOS capabilities […]

Read More

Categories

June 2017
MTWTFSS
« May  
 1234
567891011
12131415161718
19202122232425
2627282930