The Rtask people tell a story:
You have inherited (or written) a data pipeline originally coded in SAS. It processes administrative billing records: matching line items against reference tables, applying time-varying coefficients, deduplicating based on business identifiers, computing running counters. Classic ETL work.
The migration to R goes well. You use
{DBI}to open a DuckDB connection, load your source files as lazy tables via{arrow}ordplyr::tbl(), build the transformations with{dbplyr}, and collect the result at the very end. Your code is readable, your tests compare the R output to the SAS reference, and they pass (maybe using {datadiff}).Then you run the pipeline again.
The numbers are different.
Give yourself 100 points if you answered “Because you need an ORDER BY clause” during the explanation. They also cover a few other places where DuckDB interactions in R can cause issues. Most of this is straightforward for data platform people, but can cause consternation for developers. H/T R-Bloggers.