Choosing A Hadoop Data Format

Silvia Oliveros has a set of considerations to help you choose a file format for your data in Hadoop:

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

At first, I’d suggest just using delimited files, as it’s easiest that way.  Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.

Related Posts

The Business Value Of Upgrading To Hadoop 3

Roni Fontaine, Vinod Vavilapalli, and Saumitra Buragohain explain some of the business case for upgrading to Hadoop 3 from Hadoop 2: Hadoop 2 doesn’t support GPUs. Hadoop 3 enables scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning & machine learning.  This feature provides the basis for supporting GPUs […]

Read More

Installing Apache Mesos On EC2

Anubhav Tarar has a guide for setting up Apache Mesos along with Spark and Hadoop on EC2: Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment. In […]

Read More


April 2017
« Mar May »