Getting Distinct Rows In R

Kevin Feasel



Rob J. Hyndman shows four different techniques (one “classic” and three tidyverse) for getting a distinct subset of a data set in R:

So that looks much better — clean, short, and easy to understand. But is it fast? Rather than grabbing the first lines of each group, it has to go searching for duplicates. But avoiding grouping and ungrouping must save some time.

So I ran some microbenchmark timings:

Click through for techniques and timings.  I’m not surprised that the “classic” method won out in terms of time, but for explanatory value, I’d definitely prefer trying to explain the tidyverse distinct version.  H/T R-Bloggers

Related Posts

Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it: The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) […]

Read More

Zippy Base R

John Mount defends the honor of base R: The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem: Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely […]

Read More


September 2017
« Aug Oct »