S3 Or EBS?

Devadutta Ghat, et al, compare Amazon S3 versus Elastic Block Storage (EBS) on the basis of cost and Apache Impala performance:

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

  • HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem
  • OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again
  • Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

It looks like EBS is a solid choice for many workloads.

Related Posts

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

New Features In Public Preview On Azure SQL Database

Microsoft has a round of announcements for public previews on Azure SQL Database.  First up is Kevin Farlee announcing approximate count distinct: The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group. This function is designed for use in big data scenarios and is optimized for the following conditions: Access of […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930