John Mount discusses the difficulty of using dplyr to count rows in Spark:
That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting
nrow()
to return the number of rows.There are a number of common legitimate uses of
nrow()
in user code and package code including:
-
Checking if a table is empty.
-
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
-
Confirming data size is the same as reported in other sources (
Spark
,database
, and so on). -
Reporting amount of work performed or rows-per-second processed.
Read the whole thing; this seems unnecessarily complicated.