Finding Duplicates in a Spark DataFrame

Published 2021-05-21 by Kevin Feasel

The Hadoop in Real World team shows how to deduplicate rows in a DataFrame in Spark:

It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()

Where “easy” has as a modifier just how many columns you’re dealing with in the DataFrame.

Published in Hadoop and Spark

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31