Subqueries In Spark 2.0

Kevin Feasel



Davies Liu and Herman van Hövell discuss SQL subqueries in Apache Spark 2.0:

In the upcoming Apache Spark 2.0 release, we have substantially expanded the SQL standard capabilities. In this brief blog post, we will introduce subqueries in Apache Spark 2.0, including their limitations, potential pitfalls and future expansions, and through a notebook, we will explore both the scalar and predicate type of subqueries, with short examples that you can try yourself.

A subquery is a query that is nested inside of another query. A subquery as a source (inside aSQL FROM clause) is technically also a subquery, but it is beyond the scope of this post. There are basically two kinds of subqueries: scalar and predicate subqueries. And within scalar and predicate queries, there are uncorrelated scalar and correlated scalar queries and nested predicate queries respectively.

They also link to a Notebook which you can use to follow along.  If you’re interested in window functions, here are notes from Spark 1.4.

Related Posts

Exception Handling In Scala

Shivangi Gupta shows off the Either keyword in Scala: How to get values from Either? There are many ways we will talk about all one by one.  One way to get values is by doing left and right projection. We can not perform any operation i.e, map, filter etc; on Either. Either provide left and right methods to get the left and right projection. Projection on […]

Read More

Flattening JSON Data With Databricks

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark: With Databricks you get: An easy way to infer the JSON schema and avoid creating it manually Subtle changes in the JSON schema won’t break things The ability to explode nested lists into rows in a very easy way (see the […]

Read More


June 2016
« May Jul »