The basic import steps described for tiny tables applies to importing bigger tables into Impala. The difference occurs when you construct your sqoop import command. For large tables, you want it to run fast, so setting parallelism to 1, which specifies one map task during the import won’t work well. Instead, using the default parallelism setting, which is 4 map tasks to import in parallel, is a good place to start. So you don’t need to specify a value for the -m option unless you want to increase the number of parallel map tasks.
Another difference is that bigger tables usually have a primary key, which become good candidates where you can split the data without skewing it. The tiny_table we imported earlier doesn’t have a primary key. Also note that the -e option for the sqoop import command, which instructs Sqoop to import the data returned for the specified SQL statement doesn’t work if you split data on a string column. If stringcolumns are used to split the data with the -e option, it generates incompatible SQL. So if you decide to split data on the primary key for your bigger table, make sure the primary key is on a column of a numeric data type, such as int, which works best with the -e option because it generates compatible SQL.
Read the whole thing. Sqoop has been around for a while because it does its job well.