Sub-Second Hive Analytics

Kevin Feasel

2017-05-15

Hadoop

Carter Shanklin and Slim Bouguerra have started a series on using Hive and Druid to obtain sub-second SQL queries over terabytes of data:

We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results.  And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.

 WHAT IS DRUID?

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website.

This first post is mostly about Druid, which sounds like it might eventually become a very interesting technology for implementing Kimball-style warehouse models but for the whole “Joins?  We don’t need no steenkin’ joins” philosophy.  But when used as one engine component (as mentioned in the post), I can see it being quite useful.

Related Posts

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka: When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message A before crashing […]

Read More

Avro Schemas In Kafka

Stephane Maarek explains the value of using Apache Avro as a schema structure for your Kafka topics: Avro has support for primitive types ( int, string, long, bytes, etc…), complex types (enum, arrays, unions, optionals), logical types (dates, timestamp-millis, decimal), and data record (name and namespace). All the types you’ll ever need. Avro has support for embedded documentation. Although documentation is optional, in my workflow I […]

Read More

Categories