Press "Enter" to skip to content

Columnar File Formats in Hadoop

Matthew Rathbone gives us an overview of the benefits behind the ORC and Parquet file formats:

People throw this term around a lot, but I don’t think it is always clear exactly what this means in practice.

The textbook definition is that columnar file formats store data by column, not by row. CSV, TSV, JSON, and Avro, are traditional row-based file formats. Parquet, and ORC file are columnar file formats.

Read on for a comparison and example. In the SQL Server world, think columnstore versus rowstore indexes and you won’t be too far off.