MERGE In Hive

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse.  This is one additional step in that direction.

Related Posts

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase: Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini. Because PolyBase needs to have direct access to the data nodes, […]

Read More

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates: This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds […]

Read More

Categories

August 2017
MTWTFSS
« Jul Sep »
 123456
78910111213
14151617181920
21222324252627
28293031