Performance Testing Hadoop File Formats

Kevin Feasel

2017-02-14

Hadoop

Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:

The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

  • 2 x 8 cores @2.60GHz
  • 64GB of RAM
  • 2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

  • Hadoop core 2.6.0
  • Impala 2.5.0
  • Hive 1.1.0
  • HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
  • (not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

I would have liked to have seen ORC included as a file format for testing.  Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use.  For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.

Related Posts

Streaming ETL In Practice Using KSQL

Robin Moffatt builds an example of streaming ETL using Oracle, GoldenGate, and Kafka: So in this post I’m going to show an example of what streaming ETL looks like in practice. I’m replacing batch extracts with event streams, and batch transformation with in-flight transformation of these event streams. We’ll take a stream of data from […]

Read More

Automating HDF Cluster Deployment

Ali Bajwa has a how-to guide for automating HDF 3.1 cluster deployment on AWS: The release of HDF 3.1 brings about a significant number of improvements in HDF: Apache Nifi 1.5, Kafka 1.0, plus the new NiFi registry. In addition, there were improvements to Storm, Streaming Analytics Manager, Schema Registry components. This article shows how you can […]

Read More

Categories

February 2017
MTWTFSS
« Jan Mar »
 12345
6789101112
13141516171819
20212223242526
2728