Spark SQL For ETL

Kevin Feasel



Ben Snively discusses using Spark SQL as part of an ETL process:

Now interact with SparkSQL through a Zeppelin UI, but re-use the table definitions you created in the Hive metadata store.   You’ll create another table in SparkSQL later in this post to show how that would have been done there.

Connect to the Zeppelin UI and create a new notebook under the Notebook tab. Query to show the tables. You can see that the two tables you created in Hive are also available in SparkSQL.

There are a bunch of tools in here, but for me, the moral of the story is that SQL is a great language for data processing.  Spark SQL has gaps, but has filled many of those gaps over the past year or so, and I recommend giving it a shot.

Related Posts

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL: We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and […]

Read More

Warning When Using dplyr Mutate

John Mount has a warning if you are using dplyr’s mutate function and connecting to Spark or a database: If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() statement inside the same dplyr::mutate() statement. This has been my coding advice for some time, […]

Read More


May 2016
« Apr Jun »