The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.
The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.
Creating an external table is pretty easy once you have the foundation prepared.
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.
This is a very interesting set of results. First, 7Zip archived files do not work with the default encoding. I’m not particularly surprised by this result, as 7Zip support is relatively scarce across the board and it’s a niche file format (though a very efficient format).
The next failure case is tar. Tar is a weird case because it missed the first row in the file but was able to collect the remaining 776 records. Same goes for .tar.gz. I unpackaged the .tar file and the constituent SecondBasemen.csv file did in fact have all 777 records, so it’s something weird about the codec.
Stick to BZip2 and GZip if you’re using flat files.
Delimited text is exactly as it sounds: you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters). So let’s go through the options here. First, FORMAT_TYPE must be DELIMITEDTEXT. From there, we have a few FORMAT_OPTIONS. I mentioned FIELD_TERMINATOR, which is how we separate the values in a record. We can also use STRING_DELIMITER if there are quotes or other markers around our string values.
DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file. The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.
It feels like there’s a new Hadoop file format every day.
There are a couple of things I want to point out here. First, the Type is HADOOP, one of the three types currently available: HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).
Second, the Location is my name node on port 8020. If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs. In the Advanced tab, you can see the name node:
There are different options available for different sources, but this post is focused on Hadoop.
The short answer is, I’d get errors like the following when I try to run a MapReduce job:
Log Type: stderr
Log Upload Time: Thu Oct 27 00:16:23 +0000 2016
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
This was a rather vexing issue for a long time for me.
Java gets updated due to security vulnerabilities approximately once every three days, so I won’t link to any particular version. You only need to get the Java Runtime Environment (JRE), not the Java Development Kit (JDK). Anyhow, once you have that installed, you can safely install SQL Server.
In the Polybase configuration section, you have the option of making this a standalone Polybase instance or enlisting it as part of a scale-out group. In my case, I want to leave this as a standalone Polybase machine. The reason that I want to leave it as a standalone machine is that I do not have this machine on a Windows domain, and you need domain accounts for Polybase scaleout to work correctly. Later in the series, we’ll give multi-node Polybase a shot.
This is the easiest installation scenario, but it’s a start.
PolyBase, which was released with SQL Server 2016, provides another method to access live data either locally or in the cloud, very similar to the SQL Server Stretch database feature. Polybase can also provide the ability to provide a more cost-effective availability for cold data, streamlines on-premises data maintenance, and keeps data secure even during migration. Polybase differs from Stretch database in a few ways, as the SQL must be different, the speed is noticeably slower, and it is a lot less expensive. The cost is significantly less because storing data in a Azure blob store starts at 1 cent a month and Stretch database starts at $2.50 an hour. In this post,I will show how to take data which was archived due to the age of the data, which was created in 2012 and store it in an Azure Blob Storage file which will be available via Polybase when I needed.
The ideal scenario for this solution is extremely cold data which is nonetheless required as part of regulatory compliance, where having a query run for 3 hours once every six months or so is acceptable.
For all usernames and principals, we will use the suffixes like Cluster14 for name-scalability.
- Active Directory setup:
- Create a new Organizational Unit for Hadoop users in AD say (OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM).
- Create a hdfs superuser : [email protected]
- Cloudera Manager requires an Account Manager user that has privileges to create other accounts in Active Directory. You can use the Active Directory Delegate Control wizard to grant this user permission to create other users by checking the option to “Create, delete and manage user accounts”. Create a user [email protected] in OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM as an Account Manager.
Install OpenLDAP utilities (openldap-clients on RHEL/Centos) on the host of Cloudera Manager server. Install Kerberos client (krb5-workstation on RHEL/Centos) on all hosts of the cluster. This step requires internet connection in Hadoop server. If there is no internet connection in the server, you can download the rpm and install.
This is absolutely worth the read.
1) Add more files to the directory, and Polybase External table will automagically read them.
2) Do INSERTS and UPDATES from PolyBase back to your files in Hadoop.
( See PolyBase – Insert data into a Hadoop Hue Directory ,
PolyBase – Insert data into new Hadoop Directory ).
3) It’s cleaner.
This is good advice. Also, if you’re using some other process to load data—for example, a map-reduce job or Spark job—you might have many smaller file chunks based on what the reducers spit out. It’s not a bad idea to cat those file chunks together, but at least if you use a folder for your external data location, your downstream processes will still work as expected regardless of how the data is laid out.