Data Skip Techniques In Impala

Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:

Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the min_value and max_value fields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecated min and max fields.

The effectiveness of the Parquet min_value/max_value column statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added the SORT BY clause to table DDL which directs Impala to sort data locally during an INSERT before writing the data to files.

Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).

Related Posts

Kafka Connect Converters And Serialization

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques: Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include: Connectors – the JAR files that define how to integrate with the data store itself Converters – handling serialization and deserialization of […]

Read More

Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs: Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries – Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031