Press "Enter" to skip to content

Parquet Versus Avro

Matthew Rathbone compares the Parquet and Avro file formats:

JSON improves upon CSV as each row provides some indication of schema, but without a special header-row, there’s no way to derive a schema for every record in the file, and it isn’t always clear what type a ‘null’ value should be interpreted as.

Avro and Parquet on the other hand understand the schema of the data they store. When you write a file in these formats, you need to specify your schema. When you read the file back, it tells you the schema of the data stored within. This is super useful for a framework like Spark, which can use this information to give you a fully formed data-frame with minimal effort.

I was kind of hoping to see ORC in the comparison as well, though even when the Hortonworks-Cloudera competition was at its max, my recollection is that the differences between the two formats were pretty small (where ORC was a little faster for non-nested data and Parquet a little faster for nested data).