More On S3Guard

Aaron Fabbri describes how S3Guard works:

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop filesystem, S3 behaves different than HDFS.  One of the key differences is in the level of consistency provided by the underlying filesystem.  Unlike HDFS, S3 is an eventually consistent filesystem.  This means that changes made to files on S3 may not be visible for some period of time.

Many Hadoop components, however, depend on HDFS consistency for correctness. While S3 usually appears to “work” with Hadoop, there are a number of failures that do sometimes occur due to inconsistency:

  • FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing.  This is a big problem with Spark, for example.

  • Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.

  • Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.

Worth reading if you’re looking at using S3 to store data for Hadoop.  Also check out an earlier post on the topic.

Related Posts

Event Sourcing On Kafka

Adam Warski shows how you can use Apache Kafka as your event sourcing data source: There’s a number of great introductory articles, so this is going to be a very brief introduction. With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these […]

Read More

The Basics Of Kafka Security

Stephane Maarek has a nice post covering some of the basics of securing an Apache Kafka cluster: Once your Kafka clients are authenticated, Kafka needs to be able to decide what they can and cannot do. This is where Authorization comes in, controlled by Access Control Lists (ACL). ACL are what you expect them to be: […]

Read More


August 2017
« Jul Sep »