To understand how data is consumed, we need to figure out answers to some basic questions like:
- Which datasets (tables/views/DBs) are accessed frequently?
- When are the queries run most frequently?
- Which users or applications are heavily utilizing the resources?
- What type of queries are running frequently?
The most accessed object can easily benefit from optimization like compression, columnar file format, or data decomposition. A separate queue can be assigned to heavy-resource-utilizing apps or users to balance the load on a cluster. Cluster resources can be scaled up during the timeframe when a large number of queries are mostly run to meet SLAs and scaled down during low usage tide to save cost.
Hive Hooks are convenient ways to answer some of the above questions and more!
Read on to learn how.