Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:
Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the
min_value
andmax_value
fields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecatedmin
andmax
fields.The effectiveness of the Parquet
min_value
/max_value
column statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added theSORT BY
clause to table DDL which directs Impala to sort data locally during anINSERT
before writing the data to files.
Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).