You have many choices when it comes to storing and processing data on Hadoop, which can be both a blessing and a curse. The data may arrive in your Hadoop cluster in a human readable format like JSON or XML, or as a CSV file, but that doesn’t mean that’s the best way to actually store data.
In fact, storing data in Hadoop using those raw formats is terribly inefficient. Plus, those file formats cannot be stored in a parallel manner. Since you’re using Hadoop in the first place, it’s likely that storage efficiency and parallelism are high on the list of priorities, which means you need something else.
Luckily for you, the big data community has basically settled on three optimized file formats for use in Hadoop clusters: Optimized Row Columnar (ORC), Avro, and Parquet. While these file formats share some similarities, each of them are unique and bring their own relative advantages and disadvantages.
Read the whole thing. I’m partial to ORC and Avro but won’t blink if someone recommends Parquet.