Broadcast Nested Loop Joins In Spark

Kevin Feasel

2017-03-01

Spark

Reynold Xin, et al, debug an interesting test case:

While we were pretty happy with the improvement, we noticed that one of the test cases in Databricks started failing. To simulate a hanging query, the test case performed a cross join to produce 1 trillion rows.

spark.range(1000 * 1000).crossJoin(spark.range(1000 * 1000)).count()

On a single node, we expected this query would run infinitely or “hang.” To our surprise, we started seeing this test case failing nondeterministically because sometimes it completed on our Jenkins infrastructure in less than one second, the time limit we put on this query.

You’re not going to get this performance against a real data set, but it was interesting reading their troubleshooting notes.

Related Posts

Handling Missing Data In Spark

Igor Sorokin explains how to implement DataFrameNaFunctions: Unfortunately, C&P comes in to play, therefore, if at some point in time a default value for ‘trackLength’ is also required, you may end up changing both of these methods. Another disadvantage is that if another similar method, which requires the same default values, is added, code duplication […]

Read More

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works: At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore […]

Read More

Categories

March 2017
MTWTFSS
« Feb Apr »
 12345
6789101112
13141516171819
20212223242526
2728293031