The Hadoop in Real World team shows how to deduplicate rows in a DataFrame in Spark:
It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()
Where “easy” has as a modifier just how many columns you’re dealing with in the DataFrame.
Comments closed