Press "Enter" to skip to content

Join Algorithm Selection in Spark

The Hadoop in Real World team takes us through the selection criteria for join types:

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.

Spark has the following 5 algorithms to choose from –

1. Broadcast Hash Join
2. Shuffle Hash Join
3. Shuffle Sort Merge Join
4. Broadcast Nested Loop Join
5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Read on to learn which join types are supported in which circumstances, as well as rules of precedence.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.