Spark SQL For ETL

Kevin Feasel

2016-05-27

Spark

Ben Snively discusses using Spark SQL as part of an ETL process:

Now interact with SparkSQL through a Zeppelin UI, but re-use the table definitions you created in the Hive metadata store.   You’ll create another table in SparkSQL later in this post to show how that would have been done there.

Connect to the Zeppelin UI and create a new notebook under the Notebook tab. Query to show the tables. You can see that the two tables you created in Hive are also available in SparkSQL.

There are a bunch of tools in here, but for me, the moral of the story is that SQL is a great language for data processing.  Spark SQL has gaps, but has filled many of those gaps over the past year or so, and I recommend giving it a shot.

Related Posts

Databricks Runtime 5.5

Bilal Aslam and Yifan Cao announce Databricks Runtime 5.5: Secrets API in R notebooksThe Databricks Secrets API [Azure|AWS] lets you inject secrets into notebooks without hardcoding them. As of Databricks Runtime 5.5, this API is available in R notebooks in addition to existing support for Python and Scala notebooks. You can use the dbutils.secrets.get function to obtain […]

Read More

Notebooks in Azure Databricks

Brad Llewellyn takes us through Azure Databricks notebooks: Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn’t limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we […]

Read More

Categories

May 2016
MTWTFSS
« Apr Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031