MERGE In Hive

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse.  This is one additional step in that direction.

Related Posts

Five Books For Learning Kafka

Data Flair has a guide to five books to help you learn Apache Kafka: The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka. They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. […]

Read More

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams: Now we can query from it and show the aggregate window timestamp alongside the result: ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \ HOST, INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST; 1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1 1521646620000 | […]

Read More

Categories

August 2017
MTWTFSS
« Jul Sep »
 123456
78910111213
14151617181920
21222324252627
28293031