The Hadoop in Real World team takes us through the selection criteria for join types:
There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.
Spark has the following 5 algorithms to choose from –
1. Broadcast Hash Join
2. Shuffle Hash Join
3. Shuffle Sort Merge Join
4. Broadcast Nested Loop Join
5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)
Read on to learn which join types are supported in which circumstances, as well as rules of precedence.
Comments closed