Counting Rows In Spark With Dplyr

Kevin Feasel

2017-09-04

R

John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Related Posts

Microsoft R Open 3.5.1

David Smith announces Microsoft R Open 3.5.1: Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with […]

Read More

Performing Linear Regression With Power BI

Jason Cantrell shows how to create a simple linear regression in Power BI: Linear Regression is a very useful statistical tool that helps us understand the relationship between variables and the effects they have on each other. It can be used across many industries in a variety of ways – from spurring value to gaining […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930