Optimizing S3 For High Concurrency

Kevin Feasel



Aaron Friedman looks at how to optimize highly-concurrent, distributed workloads writing data to S3 buckets:

S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. Due to its underlying infrastructure, S3 is excellent for retrieving objects with known keys. S3 maintains an index of object keys in each region and partitions the index based on the key name. For best performance, keys that are often read together should not have sequential prefixes. Keys should be distributed across many partitions rather than on the same partition.

For large datasets like genomics, population-level analyses of these data can require many concurrent S3 reads by many Spark executors. To maximize performance of high-concurrency operations on S3, we need to introduce randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.

Reading the title, I wanted it to be an article on knobs to turn in S3 to maximize read performance.  It’s still an article well worth reading, but focuses from the other side:  how to write to S3 without stepping on your own toes.

Related Posts

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3: The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain. […]

Read More

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More


October 2016
« Sep Nov »