Giovanni Lanzani explains that one technique to split a data frame doesn’t quite work as expected:
Recently I was delivering a Spark course. One of the exercises asked the students to split a Spark DataFrame in two, non-overlapping, parts.
One of the students came up with a creative way to do so.
He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this,
monotonically_increasing_id
— you can find how to use it in the docs.
Read on to see how this didn’t quite work right, why it didn’t work as expected, and one alternative.