Polybase External Data Source To Hadoop

I take a look at connecting to a Hadoop cluster for Polybase:

There are a couple of things I want to point out here.  First, the Type is HADOOP, one of the three types currently available:  HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).

Second, the Location is my name node on port 8020.  If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs.  In the Advanced tab, you can see the name node:

There are different options available for different sources, but this post is focused on Hadoop.

Configuring Polybase

I have a post on setting up MapReduce in Polybase:

The short answer is, I’d get errors like the following when I try to run a MapReduce job:

Log Type: stderr
Log Upload Time: Thu Oct 27 00:16:23 +0000 2016
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

This was a rather vexing issue for a long time for me.

Installing Polybase

I have a post on installing Polybase:

Java gets updated due to security vulnerabilities approximately once every three days, so I won’t link to any particular version.  You only need to get the Java Runtime Environment (JRE), not the Java Development Kit (JDK).  Anyhow, once you have that installed, you can safely install SQL Server.

In the Polybase configuration section, you have the option of making this a standalone Polybase instance or enlisting it as part of a scale-out group.  In my case, I want to leave this as a standalone Polybase machine.  The reason that I want to leave it as a standalone machine is that I do not have this machine on a Windows domain, and you need domain accounts for Polybase scaleout to work correctly.  Later in the series, we’ll give multi-node Polybase a shot.

This is the easiest installation scenario, but it’s a start.

Polybase As Ersatz StretchDB

Ginger Grant has a great idea:

PolyBase, which was released with SQL Server 2016, provides another method to access live data either locally or in the cloud, very similar to the SQL Server Stretch database feature. Polybase can also provide the ability to provide a more cost-effective availability for cold data, streamlines on-premises data maintenance, and keeps data secure even during migration. Polybase differs from Stretch database in a few ways, as the SQL must be different, the speed is noticeably slower, and it is a lot less expensive. The cost is significantly less because storing data in a Azure blob store starts at 1 cent a month and Stretch database starts at $2.50 an hour. In this post,I will show how to take data which was archived due to the age of the data, which was created in 2012 and store it in an Azure Blob Storage file which will be available via Polybase when I needed.

The ideal scenario for this solution is extremely cold data which is nonetheless required as part of regulatory compliance, where having a query run for 3 hours once every six months or so is acceptable.

Cloudera, Polybase, And Active Directory

Ajay Jagannathan shows how to integrate a SQL Server instance + Polybase with a Cloudera Hadoop cluster, all using Active Directory for accounts:

For all usernames and principals, we will use the suffixes like Cluster14 for name-scalability.

  1. Active Directory setup:
  • Create a new Organizational Unit for Hadoop users in AD say (OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM).
  • Create a hdfs superuser : [email protected]
  • Cloudera Manager requires an Account Manager user that has privileges to create other accounts in Active Directory. You can use the Active Directory Delegate Control wizard to grant this user permission to create other users by checking the option to “Create, delete and manage user accounts”. Create a user [email protected] in OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM as an Account Manager.
  1. Install OpenLDAP utilities (openldap-clients on RHEL/Centos) on the host of Cloudera Manager server. Install Kerberos client (krb5-workstation on RHEL/Centos) on all hosts of the cluster. This step requires internet connection in Hadoop server. If there is no internet connection in the server, you can download the rpm and install.

This is absolutely worth the read.

Use Folders With Polybase

Andrew Peterson argues that you should use folders instead of individual files when creating external tables:

Why?
1) Add more files to the directory, and Polybase External table will automagically read them.
2) Do INSERTS and UPDATES from PolyBase back to your files in Hadoop.
( See PolyBase – Insert data into a Hadoop Hue Directory ,
PolyBase – Insert data into new Hadoop Directory    ).
3) It’s cleaner.

This is good advice.  Also, if you’re using some other process to load data—for example, a map-reduce job or Spark job—you might have many smaller file chunks based on what the reducers spit out.  It’s not a bad idea to cat those file chunks together, but at least if you use a folder for your external data location, your downstream processes will still work as expected regardless of how the data is laid out.

Polybase With HDP 2.5

I run into some issues with Polybase and Hortonworks Data Platform 2.5:

First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account.  That’s not a blocker here because I have an open door policy on my Hadoop cluster:  no security lockdown because it’s a sandbox with no important information.  Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129.  These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv.  Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission.  Then…nothing.  It waits for 20-30 seconds and tries again.  After four failures, it gives up.  This is why it’s taking about 90 seconds to return an error message:  it tries four times.

Aside from this audit log, there was nothing interesting on the Hadoop side.  The YARN logs had nothing in them, indicating that whatever request happened never made it that far.

Here’s hoping there’s a solution in the future.

Polybase Data Loading

Meagan Longoria explains that loading data using Polybase can be finicky:

First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.

One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.

This is definitely something to keep in mind.  I haven’t dealt with data with newlines within attributes, so I haven’t run into this yet, but don’t let it bite you.

The Benefits Of Polybase

I take a look at running a Hadoop query against a big(gish) data set:

Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop.  That’s hardly a fair setup for a distributed processing system.  Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible.  This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important.  I’m using the same hardware in all three cases, so in that sense this is a fair comp.

Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face.  I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.

Forcing Predicate Pushdown

I have a blog post on some troubles I’ve had with the FORCE EXTERNALPUSHDOWN hint:

As soon as I kick this off, I get an error:

Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.

Well, that’s not good…  Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).

I don’t have any good answers in this blog post, so I’m hoping to learn more and report back later.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031