Stephen Turner reads some data:
In these slides I also learned about the nanoparquet package — a zero dependency package for reading and writing parquet files in R. Besides all the benefits noted above, parquet is much faster to read and write. And, as opposed to saving as .rds, parquet can easily be passed back and forth between R, Python, and other frameworks.
Let’s take a look at how reading and writing parquet files compares with CSV, either with base R or readr.
Stephen shows one of the best-case scenarios for Parquet: lots of data (100 million rows), relatively few columns, no long strings, etc. That leads to a massive improvement over using CSVs, even if you ignore the metadata and formatting benefits. I wouldn’t expect the benefits to be nearly as significant with wide text columns and very little value overlap, but that’s also pretty uncommon for the type of dataset we’re analyzing in R.
Comments closed