Ram Ghadiyaram looks at three join strategies in Apache Spark:
In this article, we are going to discuss three essential joins of Apache Spark.
The data frame or table join operation is most commonly used for data transformations in Apache Spark. With Apache Spark, a developer can use joins to merge two or more data frames according to specific (sortable) keys. Writing a join operation has a straightforward syntax, but occasionally the inner workings are obscured. Apache Spark internal API suggests several algorithms for joins and selects one. A basic join operation could become costly if you do not know what these core algorithms are or which one Spark uses.
This is not a comprehensive list, but it does cover three of the more common strategies when dealing with larger datasets.
Leave a Comment