Performance Testing Hadoop File Formats

Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:

The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

2 x 8 cores @2.60GHz

64GB of RAM

2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

Hadoop core 2.6.0

Impala 2.5.0

Hive 1.1.0

HBase 1.2.0 (configured JVM heap size for region servers = 30GB)

(not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

I would have liked to have seen ORC included as a file format for testing. Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use. For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28