Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the
max_valuefields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecated
The effectiveness of the Parquet
max_valuecolumn statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added the
SORT BYclause to table DDL which directs Impala to sort data locally during an
INSERTbefore writing the data to files.
Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).