Press "Enter" to skip to content

Working with Hive in HDInsight

Brad Llewellyn takes us through building an HDInsight cluster and writing Hive queries against it:

Hive is a “SQL on Hadoop” technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL.  Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess.  Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data.  One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format.  That’s exactly what we’re going to do in this post.  You can read more about Hive here and here and Hive LLAP here.

We understand that SQL queries don’t typically constitute traditional data science functionality.  However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore.  Hive happens to be one of the best starting points on that journey.

Click through for the screenshot-laden demonstration.