Using Sqoop To Transfer Data

Kevin Feasel

2016-07-01

Hadoop

Sai Sriparasa shows how to use Sqoop to transfer data from a Hadoop cluster into a relational database:

Here are a few best practices for exporting with Sqoop:

  • Options file—As commands with Sqoop export and Sqoop import tend to be bigger in size, I recommend storing the commands in an options file. By keeping it in an options file, you can even make it part of a version control pipeline to monitor changes to the command.

  • Field termination—With Sqoop export, I recommend providing field termination metadata using the “–fields-terminated-by” option. Also, other formatting options such as “lines-terminated-by”, “enclosed-by”, “escaped-by”, etc., can be used as required.

  • Mapper tuning—When an export job is submitted, Sqoop creates a Java class and submits a MapReduce job based on input splits; then, each mapper connects to the database to export the data. The default number of mappers is 4, so I recommend tuning the number of mappers depending on the availability of processors on the cluster. Too many mappers might cause the load to increase on the database. We recommend that you monitor the number of connections and keep track of processlist on MySQL.

  • Staging table—The Sqoop export job is broken down into multiple transactions based on the mappers. Each transaction is therefore atomic and does not have any dependencies on other transactions. I recommend using the “–staging-table” option that acts as the buffer table for the separate transactions. After all transactions have been committed, a single transaction move is made to move the data to the final destination. Use the “–clear-staging-table” option to clean up the staging table after the export job.

There’s a lot in here which is Amazon-specific and there are a couple of things you’d have to change to deploy to SQL Server, but there’s a lot of useful information here.  I like that Sai shows how to use the Hadoop credential API instead of doing something silly like saving your password in plaintext.

Related Posts

Kafka Connect Converters And Serialization

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques: Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include: Connectors – the JAR files that define how to integrate with the data store itself Converters – handling serialization and deserialization of […]

Read More

Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs: Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries – Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031