MERGE In Hive

Carter Shanklin notes that Hive now has the ability to run MERGE statements:

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

This post mostly describes the gains rather than showing code, but it does show that Hive developers are looking at expanding beyond common Hadoop warehousing scenarios.

Related Posts

Pipeline Architecture With Kafka

Alexandra Wang describes how Pandora Media has used Apache Kafka for real-time ad serving using Kafka Connect: Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring […]

Read More

Using h2o.ai On HDInsight

Xiaoyong Zhu shows how to set up h2o.ai on Azure HDInsight: H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories

April 2017
MTWTFSS
« Mar  
 12
3456789
10111213141516
17181920212223
24252627282930