Comparing File Formats In Hadoop

Andrew Peterson points out performance comparisons for various Hadoop file formats:

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.

Related Posts

Capturing Implicit Conversions With Extended Events

Grant Fritchey shows how easy it is to build an extended event which captures implicit conversions: Built right into the Extended Events is an event that captures conversions that would affect execution plans, plan_affecting_convert. This event will show both CONVERT and CONVERT_IMPLICIT warnings that you would normally only see within an execution plan. You can […]

Read More

Security Improvements In Kafka And Confluent Platform

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0: Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature […]

Read More

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31