Press "Enter" to skip to content

Cloud Storage Archival via Parquet Files

Joey D’Antoni builds a tool:

What I’m writing about today has nothing to do with analytics, per se. It has everything to do with cloud storage, and the way operations there are priced. Specifically, metadata operations–in the demo code I’ve shared we’re going from five files to one, but you can imagine going from a much larger number of files to much smaller number of files. You may ask–“Joey that sounds dumb, why are you reinventing zip and iso files”. Well, the main reason is that many cloud operations are priced on the number of objects–for example if you had to calculate a checksum across a number of files on S3. (For files/objects that were created before S3 automatically did checksums).

Click through for more information on how it works, as well as a link to the GitHub repo.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.