Tomaz Kastrun continues a series on Apache Spark. Part 13 looks at bucketing and partitioning in Spark SQL:
Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between them is how they split the data.
This hint instructs Spark to use the hinted strategy on specified relation when joining tables together. When
BROADCASTJOIN
hint is used onData1
table withData2
table and overrides the suggested setting of statistics from configurationspark.sql.autoBroadcastJoinThreshold
.Spark also prioritise the join strategy, and also when different JOIN strategies are used, Spark SQL will always prioritise them.
Be sure to check those out.