Press "Enter" to skip to content

The Performance of Various Tidy Wrappers

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.tablearrowpolars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.