Spatial Functions In Hive

Constantin Stanca has a couple of posts on using Hive to implement geospatial queries.  First, an overview:

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

He follows that up with some pieces hive misses compared to SQL Server, Oracle, etc.:

As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: ESRI does not provide a commercial library for Hive including all spatial functions.

Be sure to check out that second link to get an understanding of exactly what’s missing.  Via Mark Herring.

U-SQL Updates

Michael Rys shows off what has been added to the August release for Azure Data Lake Analytics and U-SQL:

As part of the Azure Data Lake Analytics and U-SQL August 1st refresh, we released a couple of new, long awaited features, are fixing some confusing syntax and fixed several bugs

Read on for more details.

Deploying SSAS Cubes Using Powershell

Richie Lee has a Powershell script to deploy an Analysis Services cube:

Recently someone got in touch with me about one of my earliest posts. I realised I had made a mistake on that post and had updated. But I also realised that the script itself is actually part of a PowerShell module. So wanting to make things as easy as possible, and being the helpful guy that I am, I have altered that example into a script which I am posting here.

It’s worth stressing that the errors that output to the console are fairly generic, and that there is a log file that will contain all the information if there are any errors.

Sadly, you still need to write/tweak the big glob of XML, it seems.

DNS Aliases

Drew Furgiuele shows us how to use CNAME records to give us easy aliases for servers hosting SQL Server:

When you connect to a SQL Server instance, you’re most likely connecting directly to the host name of the server running that instance. So for example, if the host name of my instance is SQLSERVER-A, then in my SSMS connection (or application) I probably type in a host name offully qualified domain name (FQDN) of a server. If I want to move a database from one server to another, or stand up a new server and move everything over to it, from now on I’ll need to make sure I type in the new host name. And for a DBA, this is fine. We mostly identify our servers by the hosts they run on.

Developers and users, on the other hand, don’t always think like that. They each probably only care about one or two databases on a given instance, and depending on their release cycle even something as simple as changing a connection string might need to be a scheduled change. So when databases move or you build a new server you not only need to ensure as little downtime as possible from a system perspective but with as little impact to a user’s system too. And you can do that with the help of your network team and your local domain name system: DNS.

I’ve had great experiences with CNAME records masking actual server names.  This was most relevant in an environment in which devs just couldn’t remember which X-Men character was the production SQL Server and  which was QA.

Reverse Engineering Databases Using Biml

Kevin Feasel



Bill Fellows shows how to get a database schema using Biml:

I’m at a new client and I needed an offline version of their operation data store, ODS, database schema. I don’t know what I was expecting, but it wasn’t 11,500 tables. :O That’s a lot. First up to bat was Visual Studio Database Projects. I clicked Import and you really have no options to winnow the list of items your importing down via Import. Ten minutes later, the import timed out on spatial indexes. Which wouldn’t be so bad except it’s an all or nothing operation with import.

Fair enough, I’ll use the Schema Comparison and only compare tables, that should make it less ugly. And I suppose it did but still, the operation timed out. Now what?

SSMS to the rescue. I right click on my database and select Generate Scripts and first off, I script everything but the tables. Which is amusing when you have 11.5k tables, checking and unchecking the table box causes it to spin for a bit. I generated a file for each object with the hope that if the operation goes belly up, I’ll at least have some progress. Round 1, Winner! I had all my views, procedures, functions, data types (don’t hate), all scripted out nice and neat. Round 2, I just selected tables. And failed.

As you’d expect, there’s not that many lines of Biml code, and yet it does the job.

Installing R Packages In SQL Server

Kevin Feasel



Tomaz Kastrun shows how to install packages in SQL Server R Services:

Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.

Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used xp_cmdshell to install any additional package for my R (optionally you can setEXECUTE AS USER).

This is a bit of a backdoor method, but it does work.

Access Control Basics

Robert Sheldon gives an introductory-level overview of the basics of logins, users, roles, and permissions:

You can think of a role as a type of container for holding one or more logins, users, or other roles, similar to how a Windows group can hold multiple individual and group accounts. This can make managing multiple principals easier when those principals require the same type of access to SQL Server. You can configure each role with permissions to specific resources, adding or removing logins and users from these roles as needed.

SQL Server supports three types of roles: server, database, and application. Server roles share the same scope as logins, which means they operate at the server level and pertain to the database engine as a whole. As a result, you can add only server-level principals to the roles, and you can configure the roles with permissions only to server-level securables, not database-level securables.

These help form the foundation of a secure instance, so it’s vital to understand these concepts.

Basic Powershell Snippets

Michael Bourgon has a few Powershell snippets he wrote this week:

I needed to parse out a Unix filename from a file.  The original “ls” looked like this:

-rw-rw-r– 1 psabcderfg data 646621 Jul 19 16:25 myfile16071901.Z

The unix command to parse it would
cat $fl |awk ‘{print $9}’

And the powershell equivalent is:
$newfiles = get-content $fl | foreach {$_.split(” “)[8]}
(the number is 8 instead of 9 because powershell arrays start at 0)

Check all of them out.

Powershell Stacks

Phil Factor shows how to use Powershell’s push and pop methods to create an expression evaluator:

I use stacks for writing expression analysers. Generally I like at least two stacks, probably up to four. They tend to be different sizes and may have other differences. If written as objects, the code becomes much cleaner and easier to read. Why do I write expression analysers? You might imagine that you would never need such a thing, but once you have one, reasons for using it just keep appearing. They are handy for parsing document-based hierarchical data structures, for parsing grammars and for creating Domain-Specific Languages (DSLs). A DSL is handy when you’re writing a complex application because it cuts down dependencies, and allows you to develop scripts for complex workflows without recompiling .

What I describe here is a cut-down version of what I use, just to illustrate the way of creating and using stacks. I extend this basic algorithm, originally from Dijstra’s shunting algorithm, into a complete language interpreter. All it needs is a third stack to implement block iterations and ‘while’ loops. Why not use PowerShell itself as your DSL? I’ve tried that, but my experience is that it is pretty-well impossible to eliminate script-injection attacks, and I also find that there isn’t quite enough control.

For a more prosaic usage of the stack in Powershell (as well as bash), you can push your current location, move to a new directory, perform some action in that new directory, and pop your old location off the stack to go back to where you were before.  This is particularly useful for those Powershell modules and cmdlets which leave you in a different directory from where you started.

Ingesting E-Mail Into Hadoop

Kevin Feasel



Jordan Volz and Stefan Salandy show how to feed e-mails into Hadoop for almost-immediate analysis:

In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communication (email, IM, social media, proprietary communication tools, and so on) for a set period of time. Once data has grown past its retention period, it can then be permanently removed; in the meantime, such data is subject to e-discovery requests and legal holds. Even outside of compliance use cases, most large organizations that are subject to litigation have some form of archive in place for purposes of e-discovery.

Traditional solutions in this area comprise various moving parts and can be quite costly and complex to implement, maintain, and upgrade. By using the Hadoop stack to take advantage of cost-efficient distributed computing, companies can expect significant cost savings and performance benefits.

In this post, as a simple example of this use case, I’ll describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange.

Most of this post is about setting up the interconnections between Exchange and Apache James, and feeding data in.  It looks like this will be part 1 of a multi-part series.


May 2018
« Apr