Counting Rows In Spark With Dplyr

Kevin Feasel



John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Related Posts

xgboost and Small Numbers of Subtrees

John Mount covers an interesting issue you can run into when using xgboost: While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).In doing that I ran into one more […]

Read More

Reinforcement Learning with R

Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning: At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one […]

Read More


September 2017
« Aug Oct »