Using Sqoop To Transfer Data

Kevin Feasel



Sai Sriparasa shows how to use Sqoop to transfer data from a Hadoop cluster into a relational database:

Here are a few best practices for exporting with Sqoop:

  • Options file—As commands with Sqoop export and Sqoop import tend to be bigger in size, I recommend storing the commands in an options file. By keeping it in an options file, you can even make it part of a version control pipeline to monitor changes to the command.

  • Field termination—With Sqoop export, I recommend providing field termination metadata using the “–fields-terminated-by” option. Also, other formatting options such as “lines-terminated-by”, “enclosed-by”, “escaped-by”, etc., can be used as required.

  • Mapper tuning—When an export job is submitted, Sqoop creates a Java class and submits a MapReduce job based on input splits; then, each mapper connects to the database to export the data. The default number of mappers is 4, so I recommend tuning the number of mappers depending on the availability of processors on the cluster. Too many mappers might cause the load to increase on the database. We recommend that you monitor the number of connections and keep track of processlist on MySQL.

  • Staging table—The Sqoop export job is broken down into multiple transactions based on the mappers. Each transaction is therefore atomic and does not have any dependencies on other transactions. I recommend using the “–staging-table” option that acts as the buffer table for the separate transactions. After all transactions have been committed, a single transaction move is made to move the data to the final destination. Use the “–clear-staging-table” option to clean up the staging table after the export job.

There’s a lot in here which is Amazon-specific and there are a couple of things you’d have to change to deploy to SQL Server, but there’s a lot of useful information here.  I like that Sai shows how to use the Hadoop credential API instead of doing something silly like saving your password in plaintext.

Related Posts

A Simple Example With Spark And Kafka

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka: This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this […]

Read More

Talking To Secure Hadoop Clusters

Mubashir Kazia shows how to connect to a secured Hadoop cluster using Active Directory: The primary form of strong authentication used on a secure cluster is Kerberos. Kerberos supports credentials delegation where a server process to which a user has authenticated, can perform actions on behalf of the user. This involves the server process accessing […]

Read More


July 2016
« Jun Aug »