Counting Rows In Spark With Dplyr

Kevin Feasel



John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Related Posts

Improving Plots With ggformula

Sebastian Sauer shows how you can use the ggformula package combined with ggplot2 to enhance your R visuals: Since some time, there’s a wrapper for ggplot2 available, bundled in the package ggformula. One nice thing is that in that it plays nicely with the popular R package mosaic. mosaic provides some useful functions for modeling along with a tamed and consistent […]

Read More

Installing R From Powershell

Tomaz Kastrun shows us how to install R and RStudio via Powershell: For the brevity of this post, I will only download couple of R packages from CRAN repository, but this list is indefinite.There are ways many ways to retrieve the CRAN packages for particular R version using powershell. I will just demonstrate this by […]

Read More


September 2017
« Aug Oct »