Hive Data Ingestion In AWS

Kevin Feasel

2016-08-26

Hadoop

Songzhi Liu shows how to use the AWS stack to move data into Hive:

S3 bucket
In this framework, S3 is the start point and the place where data is landed and stored. You will configure the S3 bucket notifications as the event source that triggers the Lambda function. When a new object is stored/copied/uploaded in the specified S3 bucket, S3 sends out a notification to the Lambda function with the key information.

Lambda function
Lambda is a serverless technology that lets you run code without a server. The Lambda function is triggered by S3 as new data lands and then adds new partitions to Hive tables. It parses the S3 object key using the configuration settings in the DynamoDB tables.

DynamoDB table
DynamoDB is a NoSQL database (key-value store) service. It’s designed for use cases requiring low latency responses, as it provides double-digit millisecond level response at scale. DynamoDB is also a great place for metadata storage, given its schemaless design and low cost when high throughput is not required. In this framework, DynamoDB stores the schema configuration, table configuration, and failed actions for reruns.

EMR cluster
EMR is the managed Hadoop cluster service. In the framework, you use Hive installed on an EMR cluster.

This is a detailed post, but well worth a read if you’re on AWS.

Related Posts

Summarizing Improvements In Spark 2.4

Anmol Sarna summarizes Apache Spark 2.4 and pushes his meme game at the same time: The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.Spark 2.4 introduced 24 new built-in functions, such as  array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc.The entire […]

Read More

What’s New In Ambari 2.7

Paul Codding and Kat Petre share some of the new features in Ambari 2.7: With this release, we wanted to make Ambari more enjoyable to use every day, simplify finding and using our API, and unblock teams managing very large clusters.  Here is a preview of a few features we’re excited to share with you:Revamped […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031