Spatial Functions In Hive

Constantin Stanca has a couple of posts on using Hive to implement geospatial queries.  First, an overview:

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

He follows that up with some pieces hive misses compared to SQL Server, Oracle, etc.:

As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: ESRI does not provide a commercial library for Hive including all spatial functions.

Be sure to check out that second link to get an understanding of exactly what’s missing.  Via Mark Herring.

Related Posts

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More


August 2016
« Jul Sep »