Counting Rows In Spark With Dplyr

Kevin Feasel



John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Related Posts

Linear Discriminant Analysis

Jake Hoare explains Linear Discriminant Analysis: Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this […]

Read More

Azure Data Lake Store File Management With httr

Leila Etaati shows how to generate RESTful statements in R using httr: In this post, I am going to share my experiment in how to do file management in ADLS using R studio, to do this you need to have below items 1. An Azure subscription 2. Create an Azure Data Lake Store Account 3. […]

Read More


September 2017
« Aug Oct »