Konstantin Shvachko, et al, explain some of the changes to the Hadoop Distributed File System needed to scale to one exabyte of data:
LinkedIn runs its big data analytics on Hadoop. During the last five years, the analytics infrastructure has experienced tremendous growth, almost doubling every year in data size, compute workloads, and in all other dimensions. It recently reached two important milestones.
1. LinkedIn now stores 1 exabyte of total data across all Hadoop clusters.
2. Our largest 10,000-node cluster stores 500 PB of data. It maintains 1 billion objects (directories, files, and blocks) on a single NameNode serving RPCs with an average latency under 10 milliseconds, making it one of the largest (if not the largest) Hadoop cluster in the industry.
From the early days of LinkedIn, Apache Hadoop was the basis of our analytics infrastructure. Many teams assisted in this effort to make Hadoop our canonical big data platform.
Read on for different techniques they’ve used, as well as code changes implemented in HDFS to support this data size.