Counting Rows In Spark With Dplyr

Kevin Feasel



John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Related Posts

Creating Map Plots With ggmap

Laura Ellis shows how to use the ggmap package to create choropleth maps in R: In the last map, it was a bit tricky to see the density of the incidents because all the graphed points were sitting on top of each other.  In this scenario, we are going to make the data all one […]

Read More

R 3.5.0 Released

Tal Galili announces that R 3.5.0 is now available: By default the (arbitrary) signs of the loadings from princomp() are chosen so the first element is non-negative. If –default-packages is not used, then Rscript now checks the environment variable R_SCRIPT_DEFAULT_PACKAGES. If this is set, then it takes precedence over R_DEFAULT_PACKAGES. If default packages are not specified on the command line or by one […]

Read More


September 2017
« Aug Oct »