Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:
Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the
min_valueandmax_valuefields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecatedminandmaxfields.The effectiveness of the Parquet
min_value/max_valuecolumn statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added theSORT BYclause to table DDL which directs Impala to sort data locally during anINSERTbefore writing the data to files.
Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).