Vectorization With Apache Hive And Parquet Tables

Vihang Karajgaonkar, et al, take us through using a performance improvement in Apache Hive using Parquet tables:

The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.

This is worth looking into, especially if you are on the Cloudera stack.

Related Posts

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class: Let me tell you about one of my least favorite things I like to see in PolyBase: Error: Could not find or load main class This error is not limited to PolyBase but is instead […]

Read More

Using the StreamSets Snowflake Destination

Dash Desai shows how you can use StreamSets to write data into SnowflakeDB: In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically. To illustrate, let’s take HTTP web server logs […]

Read More


December 2018
« Nov Jan »