Databricks Delta: Data Skipping And ZORDER Clustering

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster:

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

Read the whole thing.

Related Posts

From pandas to Spark with koalas

Achilleus tries out Koalas: Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium […]

Read More

Overriding Spark Dependencies

Landon Robinson shows how to override a Spark dependency located on the classpath: This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:– gson-2.2.4.jar: the method is private, and therefore too old for use here– gson-2.6.1: the method is public, and works fine.– Somewhere between the two, the […]

Read More


August 2018
« Jul Sep »