S3Guard

Mingliang Liu and Rajesh Balamohan explain why you shouldn’t use S3 as your primary Hadoop data store, as well as a tool which helps mitigate those problems:

Some of the real world use cases which can be impacted due to the S3 eventual consistency model are:

  1. Listing Files. Newly created files might not be visible for data processing. In Hive, Spark and MapReduce, this can lead to erroneous results from incomplete source data or failure to commit all intermediate results.

  2. ETL Workflow. Systems like Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.

  3. Existence-guarded path operations. Any action which fails if the destination path is present may see a deleted file in a listing, and so fail — even though the file has already been deleted.

Read on to see how S3Guard works and how to enable it in HDP 2.6.

Related Posts

Azure Data Lake Alerting

Jose Lara shows how to send alerts if you hit a utilization threshold: If you want to see the step-by-step guide to create a new Log Analytics alert, check out our recent blog post on creating Log Analytics Alerts. For the alert signal logic, use the following values: Use the query from the previous step Set […]

Read More

Running The Azure DTU Calculator On An Older Server

Jim Donahoe shows us how to get the Azure DTU calculator running on an older server without Powershell: I recently had to do an analysis of a client’s database workload using the Azure DTU Calculator(DTU Calculator) and thought it might be interesting to share just how I did that.  I have run this tool numerous […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930