Performance Testing Hadoop File Formats

Kevin Feasel

2017-02-14

Hadoop

Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:

The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

  • 2 x 8 cores @2.60GHz
  • 64GB of RAM
  • 2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

  • Hadoop core 2.6.0
  • Impala 2.5.0
  • Hive 1.1.0
  • HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
  • (not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

I would have liked to have seen ORC included as a file format for testing.  Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use.  For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.

Related Posts

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames: In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant […]

Read More

Categories

February 2017
MTWTFSS
« Jan Mar »
 12345
6789101112
13141516171819
20212223242526
2728