Press "Enter" to skip to content

Fixing the Small File Problem in Hadoop

Guy Shilo takes us through the Hadoop Archive format:

It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down.

The best way to fix this situation is, of course to avoid it in first place. This can be done when designing the application or the pipeline that inserts the data into HDFS, for example, by bundling many files into one container such as sequencefile, Avro or Hadoop archive (.har file).

Hadoop archive is somewhat overlooked option that I want to demonstrate today. You will see that it can be very useful in some cases but not so great in others.

Read the whole thing before giving it a try, as there are some downsides.