Press "Enter" to skip to content

Taking a Billion Taxi Rides with DuckDB

Mark Litwintschik tries out DuckDB:

DuckDB is an in-process database. Rather than relying on a server of its own, it’s used as a client. The client can work with data in memory, within DuckDB’s internal file format, database servers from other software developers and cloud storage services such as AWS S3.

This choice to not centralise DuckDB’s data within its own server, paired with being distributed as a single binary, makes installing and working with DuckDB much less complex than say, standing up a Hadoop Cluster.

The project isn’t aimed at very large datasets. Despite this, its ergonomics are enticing enough and it does so much to reduce engineering time that workarounds are worth considering. The rising popularity of analysis-ready, cloud-optimised Parquet files is removing the need for substantial hardware when dealing with datasets in the 100s of GBs or larger.

Read on to learn more about DuckDB, how it differs from SQLite, and a bit of nuttiness around how far you can push an in-memory database.