Press "Enter" to skip to content

When repartition() Beats coalesce() in Spark

Janani Annur Thiruvengadam stands some common advice on its head:

If you’ve worked with Apache Spark, you’ve probably heard the conventional wisdom: “Use coalesce() instead of repartition() when reducing partitions — it’s faster because it avoids a shuffle.” This advice appears in documentation, blog posts, and is repeated across Stack Overflow threads. But what if I told you this isn’t always true?

In a recent production workload, I discovered that using repartition() instead of coalesce() resulted in a 33% performance improvement (16 minutes vs. 23 minutes) when writing data to fewer partitions. This counterintuitive result reveals an important lesson about Spark’s Catalyst optimizer that every Spark developer should understand.

Read on for the details on that scenario.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.