Using Sqoop To Move Data To Hadoop

Kevin Feasel

2017-07-18

Hadoop

The folks at Redglue have a few hints on using Sqoop to move data from a relational database to Hadoop:

  • “Data gets updated” problem

Data gets updated many times and loading data with Sqoop is not a single event as data that you are importing can be updated (INSERTed, DELETed or UPDATed). What is important here, is that, HDFS is an “append-only filesystem” (exceptions made to HBase and Hive with ACID, but they are mostly tricks) and the options are pretty simple: replace the dataset, add data to dataset (partition for example) or merge datasets between old and new data.

If the data that you are loading is a small dataset, don’t think twice, replace and overwrite it.

If the data that you are loading is a big data set, a “incremental” load is recommended. This can be a little tricky as Sqoop needs to know what modification were done since the last incremental or full import.

I’m not a huge fan of Sqoop and prefer to use my own ingest mechanisms, but it’s an easy way to get started.

Related Posts

Anomaly Detection With Kafka Streams

Ajmal Karuthakantakath shows us an application which performs fairly simple anomaly detection using Kafka Streams: The problem is in the banking loan payment domain, where customers have taken a loan and they need to make monthly payments to repay the loan amount. Assume there are millions of customers in the system and all these customers need […]

Read More

Crossing The Streams With Kafka

Himani Arora shows how to join two Kafka streams together: KStream-KStream Join It is a sliding window join, that means, all tuples close to each other with regard to time are joined. Time here is the difference up to size of the window. These joins are always windowed joins because otherwise, the size of the internal state […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31