More On S3Guard

Aaron Fabbri describes how S3Guard works:

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop filesystem, S3 behaves different than HDFS.  One of the key differences is in the level of consistency provided by the underlying filesystem.  Unlike HDFS, S3 is an eventually consistent filesystem.  This means that changes made to files on S3 may not be visible for some period of time.

Many Hadoop components, however, depend on HDFS consistency for correctness. While S3 usually appears to “work” with Hadoop, there are a number of failures that do sometimes occur due to inconsistency:

  • FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing.  This is a big problem with Spark, for example.

  • Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.

  • Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.

Worth reading if you’re looking at using S3 to store data for Hadoop.  Also check out an earlier post on the topic.

Related Posts

Anomaly Detection With Kafka Streams

Ajmal Karuthakantakath shows us an application which performs fairly simple anomaly detection using Kafka Streams: The problem is in the banking loan payment domain, where customers have taken a loan and they need to make monthly payments to repay the loan amount. Assume there are millions of customers in the system and all these customers need […]

Read More

Master Data In Azure

Matt How explains why Master Data Services isn’t a great cloud-based master data management solution and offers up an alternative: Excel is easy to use, but not user friendly Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach […]

Read More

Categories

August 2017
MTWTFSS
« Jul Sep »
 123456
78910111213
14151617181920
21222324252627
28293031