Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Related Posts

Clients For Working With HDFS

Mark Litwintschik reviews several clients for working with the Hadoop Distributed Filesystem: The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of […]

Read More

Monitoring Apache NiFi With A Custom Dashboard

Tim Spann has started a new series on monitoring Apache NiFi: In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. […]

Read More


January 2018
« Dec Feb »