Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:
The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:
- 2 x 8 cores @2.60GHz
- 64GB of RAM
- 2 x 24 SAS drives
Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:
- Hadoop core 2.6.0
- Impala 2.5.0
- Hive 1.1.0
- HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
- (not from CDH) Kudu 1.0 (configured memory limit = 30GB)
Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.
I would have liked to have seen ORC included as a file format for testing. Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use. For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.