Using Hive For Data Virtualization

Kevin Feasel

2018-12-28

Hadoop

Gunther Hagleitner, et al, walk us through some reasons why we might want to use Apache Hive for data virtualization:

Assume you want to execute a Hive query that accesses data from an external RDBMS behind a JDBC connection. A possible naïve way of doing this would treat the JDBC source as a “dumb” storage system, reading all the raw data over JDBC and processing it in Hive. In this case you would ignore the query capabilities of the RDBMS and pull too much data over the JDBC link, thus ending up with poor performance and an overloaded system.
For that reason, Hive implements smart push-down to other systems by relying on its storage handler interfaces and cost-based optimizer (CBO) powered by Apache Calcite. In particular, Calcite provides rules that match a subset of operators in the logical representation of the query and generates a new equivalent representation with more operations executed in the external system. Hive includes those rules that push computation to the external systems in its query planner, and then relies on Calcite to generate a valid query in the language that those systems support. The storage handler implementations are responsible to send the generated query to the external system, retrieve its results, and transform the incoming data into Hive internal representation so it can be processed further if needed.

A lot of platforms are moving toward data virtualization (e.g., SQL Server with its Big Data Clusters). That appears to be the next product battleground.

Related Posts

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class: Let me tell you about one of my least favorite things I like to see in PolyBase: Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster This error is not limited to PolyBase but is instead […]

Read More

Database-First or Kafka-First for Event Streaming

Gwen Shapiro takes us through a scenario where database-first writes for event streaming makes the most sense: Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more. If you were to go Kafka-first… well, […]

Read More

Categories

December 2018
MTWTFSS
« Nov Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31