Press "Enter" to skip to content

Finding Duplicates in a Spark DataFrame

The Hadoop in Real World team shows how to deduplicate rows in a DataFrame in Spark:

It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()

Where “easy” has as a modifier just how many columns you’re dealing with in the DataFrame.