Janani Annur Thiruvengadam stands some common advice on its head:
If you’ve worked with Apache Spark, you’ve probably heard the conventional wisdom: “Use
coalesce()instead ofrepartition()when reducing partitions — it’s faster because it avoids a shuffle.” This advice appears in documentation, blog posts, and is repeated across Stack Overflow threads. But what if I told you this isn’t always true?In a recent production workload, I discovered that using
repartition()instead ofcoalesce()resulted in a 33% performance improvement (16 minutes vs. 23 minutes) when writing data to fewer partitions. This counterintuitive result reveals an important lesson about Spark’s Catalyst optimizer that every Spark developer should understand.
Read on for the details on that scenario.
Leave a Comment