Even for a simple query, I’m not going to expect you to read 174 lines of XML; I’m not a sadist, after all…
What follows is a look at significant lines and my commentary.
Don’t listen to me there; that guy really is a sadist who wants you to read 174 lines of XML.
Once we did that and I restarted all of the services, I ended up getting an interesting error message from SQL Server:
Msg 7320, Level 16, State 110, Line 2
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to JobSubmitter_SubmitJob: Error [org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=512
The error message is pretty clear: the Polybase service wants to create containers that are 1536 MB in size, but the maximum size I’m allowing is 512 MB. Therefore, the Polybase MapReduce operation fails.
Long story short, I needed enough RAM to be able to give 4 1/2 GB to YARN for creating MapReduce containers in order to run my query.
As a reminder, in order to allow predicate pushdown to occur, we need to hit a Hadoop cluster; we can’t use predicate pushdown on other systems like Azure Blob Storage. Second, we need to have a resource manager link set up in our external data source. Third, we need to make sure that everything is configured correctly on the Polybase side. But once you have those items in place, it’s possible to use the FORCE EXTERNALPUSHDOWN command like so:
There’s also discussion of preventing MapReduce job creation as well as a pushdown-related error I had received in the past.
Stream 2 sends along 27 MB worth of data. It’s packaging everything Polybase needs to perform operations, including JAR files and any internal conversion work that the Polybase engine might need to translate Hadoop results into SQL Server results. For example, the sqlsort at the bottom is a DLL that performs ordering using SQL Server-style collations, as per the Polybase academic paper.
Stream 2 is responsible for almost every packet from 43 through 19811; only 479 packets in that range belonged to some other stream (19811 – 43 – 18864 – 425 = 479). We send all of this data to the data node via port 50010.
If you love looking at Wireshark streams, you’ll love this post.
The dm_exec_external_work DMV tells us which execution we care about; in this case, I ended up running the same query twice, but I decided to look at the first run of it. Then, I can get step information from dm_exec_distributed_request_steps. This shows that we created a table in tempdb called TEMP_ID_14 and streamed results into it. The engine also created some statistics (though I’m not quite sure where it got the 24 rows from), and then we perform a round-robin query. Each Polybase compute node queries its temp table and streams the data back to the head node. Even though our current setup only has one compute node, the operation is the same as if we had a dozen Polybase compute nodes.
Click through for Wireshark-related fun.
Polybase was first made available in Analytics Platform System in March 2013, and then in SQL Server 2016. The announcement at the PASS Summit was that by preview early next year, in addition to Hadoop and Azure blob storage, PolyBase will support Teradata, Oracle, SQL Server, and MongoDB in SQL Server 2016. And the Azure Data Lake Store will be supported in Azure SQL Data Warehouse PolyBase.
With SQL Server 2016, you can create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance (see PolyBase scale-out groups):
I’m excited for the future of Polybase and looking forward to vNext and vNext + 1 (for the stuff which they can’t possibly get done in time for vNext).
The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.
The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.
Creating an external table is pretty easy once you have the foundation prepared.
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.
This is a very interesting set of results. First, 7Zip archived files do not work with the default encoding. I’m not particularly surprised by this result, as 7Zip support is relatively scarce across the board and it’s a niche file format (though a very efficient format).
The next failure case is tar. Tar is a weird case because it missed the first row in the file but was able to collect the remaining 776 records. Same goes for .tar.gz. I unpackaged the .tar file and the constituent SecondBasemen.csv file did in fact have all 777 records, so it’s something weird about the codec.
Stick to BZip2 and GZip if you’re using flat files.
Delimited text is exactly as it sounds: you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters). So let’s go through the options here. First, FORMAT_TYPE must be DELIMITEDTEXT. From there, we have a few FORMAT_OPTIONS. I mentioned FIELD_TERMINATOR, which is how we separate the values in a record. We can also use STRING_DELIMITER if there are quotes or other markers around our string values.
DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file. The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.
It feels like there’s a new Hadoop file format every day.