The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.
Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.
I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.
It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.