Performance Testing Hadoop File Formats

Kevin Feasel

2017-02-14

Hadoop

Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:

The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

  • 2 x 8 cores @2.60GHz
  • 64GB of RAM
  • 2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

  • Hadoop core 2.6.0
  • Impala 2.5.0
  • Hive 1.1.0
  • HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
  • (not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

I would have liked to have seen ORC included as a file format for testing.  Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use.  For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.

Related Posts

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

Categories

February 2017
MTWTFSS
« Jan Mar »
 12345
6789101112
13141516171819
20212223242526
2728