Hadoop’s S3 Support

Steve Loughran and Sanjay Radia give us a history lesson on Hadoop’s support for Amazon S3:

Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop to be run in Amazon’s EMR infrastructure, using S3 as the persistent store of work. This piece of open source code predated Amazon’s release of EMR, “Elastic MapReduce” by over two years. It’s also notable as the piece of work which gained Tom White, author of “Hadoop, the Definitive Guide”, committer status.

It’s interesting to see how this project has matured over the past decade.

Related Posts

Combining Stream Analytics And Azure ML With Power BI

Brad Llewellyn shows us how to feed Azure ML predictions into Power BI via Azure Stream Analytics: Today, we’re going to talk about combining Stream Analytics with Azure Machine Learning Studio within Power BI.  If you haven’t read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering, Time Series Decomposition, Forecasting, Correlations, Custom R Visuals, R Scripts in Query […]

Read More

Working With Skewed Data In Pig

Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data: The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s […]

Read More

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31