Taxi Cab Data On sqlite And Parquet

Mark Litwintschik loads the 1.1 billion rows of New York City taxi data into a SQLite database using data stored on Parquet-formatted files living on HDFS:

The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.

Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.

I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.

It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.

Related Posts

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

Replicating Data In HDFS Between Clusters

Murali Ramasami and Niru Anisetti have an article showing how to use the Hortonworks Data Lifecycle Manager to set up replication between two Hadoop clusters: Data Lifecycle Manager (DLM) delivers on the promise of location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments. This empowers businesses to […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930