Giovanni Lanzani explains that one technique to split a data frame doesn’t quite work as expected:
Recently I was delivering a Spark course. One of the exercises asked the students to split a Spark DataFrame in two, non-overlapping, parts.
One of the students came up with a creative way to do so.
He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this,
monotonically_increasing_id
— you can find how to use it in the docs.
Read on to see how this didn’t quite work right, why it didn’t work as expected, and one alternative.
Comments closed