-
“Data gets updated” problem
Data gets updated many times and loading data with Sqoop is not a single event as data that you are importing can be updated (INSERTed, DELETed or UPDATed). What is important here, is that, HDFS is an “append-only filesystem” (exceptions made to HBase and Hive with ACID, but they are mostly tricks) and the options are pretty simple: replace the dataset, add data to dataset (partition for example) or merge datasets between old and new data.
If the data that you are loading is a small dataset, don’t think twice, replace and overwrite it.
If the data that you are loading is a big data set, a “incremental” load is recommended. This can be a little tricky as Sqoop needs to know what modification were done since the last incremental or full import.
I’m not a huge fan of Sqoop and prefer to use my own ingest mechanisms, but it’s an easy way to get started.