John Mount discusses the difficulty of using dplyr to count rows in Spark:
That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting
nrow()
to return the number of rows.There are a number of common legitimate uses of
nrow()
in user code and package code including:
Checking if a table is empty.
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
Confirming data size is the same as reported in other sources (
Spark
,database
, and so on).Reporting amount of work performed or rows-per-second processed.
Read the whole thing; this seems unnecessarily complicated.