David Dichmann walks us through the differences between Impala and Hive LLAP:
Written in C++, which is very CPU efficient, with a very fast query planner and metadata caching, Impala is optimized for low latency queries. Because of this, Impala is an ideal engine for use with a data mart, since people working with data marts are mostly running read-only queries and not large scale writes.
Impala also has a very efficient run-time execution framework, using code generation, process-to-process communication, massive parallelism, and metadata caching. Because of this, Impala is also great when working with ad-hoc queries, like when exploring by iteratively digging into data. You’ll want to change your query over and over again, at a moment’s notice, and have very fast response times so you’re not waiting forever for each iteration.
I was curious what would end up happening with Hive and Impala once old Cloudera (Impala) and Hortonworks (Hive) merged together. Looks like the answer, at least for now, is that they’re both useful in different circumstances. But I do wonder how long that lasts—it’s not impossible to sell using two separate data platform products for different steps in a warehouse implementation, but I could see architects and CIOs wanting to make things simpler and narrow down to one unless there was a particularly smooth bridge between the two.
Comments closed