Databricks Delta: Data Skipping And ZORDER Clustering

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster:

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

Read the whole thing.

Related Posts

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Kafka Connect Converters And Serialization

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques: Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include: Connectors – the JAR files that define how to integrate with the data store itself Converters – handling serialization and deserialization of […]

Read More


August 2018
« Jul Sep »