When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL:

We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.

But the main question, as usual, is: Will SparkSQL queries and HiveQL queries blend?

Topic: Age distribution for men and women in the U.S. state of Maine.

Challenge: Blend results from Hive SQL and Spark SQL queries.

Access mode: Apache Spark and Apache Hive nodes for SQL processing.

Using KNIME, the authors are able to blend together data from different sources.

Related Posts

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta: The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the […]

Read More

Working With Kafka At Scale

Tony Mancill has some tips for working with large-scale Kafka clusters: Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you’re operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this: First, consumers of the “hot” […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031