Z-Ordering with Apache Impala

Zoltan Borok-Nagy and Norbert Luksa show off a performance improvement in Apache Impala:

So we’ll have great search capabilities against the partition columns plus one data column (which drives the ordering in the data files). With our sample schema above, this means we could specify a SORT BY “platform” to enable fast analysis of all Android or iOS users. But what if we wanted to understand how well version 5.16 of our app is doing across platforms and countries?

Can we do more? It turns out that we can. There are exotic orderings out there that can also sort data by multiple columns. In this post, we will describe how Z-order allows ordering of multidimensional data (multiple columns) with the help of a space-filling curve. This ordering enables us to efficiently search against more columns. More on that later.

It looks like a really good technique for nearly-static data, sort of like you’d see with a data warehouse which refreshes once a day.