Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?
In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.
Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.
Leave a Comment