Comparing File Formats In Hadoop

Andrew Peterson points out performance comparisons for various Hadoop file formats:

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.

Related Posts

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka: When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message A before crashing […]

Read More

Reducing Reads In Queries

Bert Wagner has a few tips for improving query performance by reducing the number of reads: If SQL Server thinks it only is going to read 1 row of data, but instead needs to read way more rows of data, it might choose a poor execution plan which results in more reads. You might get a suboptimal […]

Read More

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31