Sqoop From MySQL To Cloudera

Kevin Feasel



Alan Choi and Laurel Hale show us how to use Sqoop to migrate data from MySQL into Impala:

The basic import steps described for tiny tables applies to importing bigger tables into Impala. The difference occurs when you construct your sqoop import command. For large tables, you want it to run fast, so setting parallelism to 1, which specifies one map task during the import won’t work well. Instead, using the default parallelism setting, which is 4 map tasks to import in parallel, is a good place to start. So you don’t need to specify a value for the -m option unless you want to increase the number of parallel map tasks.
Another difference is that bigger tables usually have a primary key, which become good candidates where you can split the data without skewing it. The tiny_table we imported earlier doesn’t have a primary key. Also note that the -e option for the sqoop import command, which instructs Sqoop to import the data returned for the specified SQL statement doesn’t work if you split data on a string column. If stringcolumns are used to split the data with the -e option, it generates incompatible SQL. So if you decide to split data on the primary key for your bigger table, make sure the primary key is on a column of a numeric data type, such as int, which works best with the -e option because it generates compatible SQL.

Read the whole thing. Sqoop has been around for a while because it does its job well.

Related Posts

Pivoting Spark DataFrames

Unmesha Sreeveni shows how we can pivot a DataFrame in Apache Spark using one line of code: A pivot can be thought of as translating rows into columns while applying one or more aggregations. Lets see how we can achieve the same using the above dataframe. We will pivot the data based on “Item” column. […]

Read More

Troubleshooting Spark Performance

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark: Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing […]

Read More


February 2019
« Jan Mar »