Sub-Second Hive Analytics

Kevin Feasel

2017-05-15

Hadoop

Carter Shanklin and Slim Bouguerra have started a series on using Hive and Druid to obtain sub-second SQL queries over terabytes of data:

We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results.  And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.

 WHAT IS DRUID?

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website.

This first post is mostly about Druid, which sounds like it might eventually become a very interesting technology for implementing Kimball-style warehouse models but for the whole “Joins?  We don’t need no steenkin’ joins” philosophy.  But when used as one engine component (as mentioned in the post), I can see it being quite useful.

Related Posts

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3: The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain. […]

Read More

Categories