Kevin Feasel



Carter Shanklin notes that Hive now has the ability to run MERGE statements:

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

This post mostly describes the gains rather than showing code, but it does show that Hive developers are looking at expanding beyond common Hadoop warehousing scenarios.

Related Posts

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark: Time series analysis has two components: time series manipulation and time series modeling. Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data […]

Read More

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node: RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon […]

Read More


April 2017
« Mar May »