First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account. That’s not a blocker here because I have an open door policy on my Hadoop cluster: no security lockdown because it’s a sandbox with no important information. Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129. These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv. Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission. Then…nothing. It waits for 20-30 seconds and tries again. After four failures, it gives up. This is why it’s taking about 90 seconds to return an error message: it tries four times.
Aside from this audit log, there was nothing interesting on the Hadoop side. The YARN logs had nothing in them, indicating that whatever request happened never made it that far.
Here’s hoping there’s a solution in the future.
First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.
One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.
This is definitely something to keep in mind. I haven’t dealt with data with newlines within attributes, so I haven’t run into this yet, but don’t let it bite you.
Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop. That’s hardly a fair setup for a distributed processing system. Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible. This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important. I’m using the same hardware in all three cases, so in that sense this is a fair comp.
Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face. I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.
As soon as I kick this off, I get an error:
Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
Well, that’s not good… Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).
I don’t have any good answers in this blog post, so I’m hoping to learn more and report back later.
One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.
Archival is a very good use case for external table insertion, and if you don’t have a Hadoop cluster, you could insert into Azure blob storage.
Let’s walk through this one step at a time and understand what the DMV is telling us. Unfortunately, the DMV documentation is a little sparse, so some of this is guesswork on my part.
A RandomIDOperation appears to create a temporary table. In this case, the table (whose name is randomly generated) is named TEMP_ID_53. I’m not sure where that name comes from; the session I ran this from was 54, so it wasn’t a session ID.
After the table gets created, each Compute node gets told to create a table called TMP_ID_53 in tempdb whose structure matches our external table’s structure. One thing you can’t see from the screenshot is that this table is created with DATA_COMPRESSION = PAGE. I have to wonder if that’d be the same if my Compute node were on Standard edition.
We add an extended property on the table, flagging it as IS_EXTERNAL_STREAMING_TABLE.
We then update the statistics on that temp table based on expected values. 629 rows are expected here.
Then, we create the dest stat, meaning that the temp table now has exactly the same statistics as our external table.
The next step is that the Head node begins a MultiStreamOperation, which tells the Compute nodes to begin working. This operator does not show up in the documentation, but we can see that the elapsed time is 58.8 seconds, which is just about as long as my query took. My guess is that this is where the Head node passes code to the Compute nodes and tells them what to do.
We have a HadoopRoundRobinOperation on DMS, which stands for “Data Movement Step” according to the location_type documentation. What’s interesting is that according to the DMV, that operation is still going. Even after I checked it 40 minutes later, it still claimed to be running. If you check the full query, it’s basically a SELECT * from our external table.
Next is a StreamingReturnOperation, which includes our predicate WHERE dest = ‘ORD’ in it. This is a Data Movement Step and includes all of the Compute nodes (all one of them, that is) sending data back to the Head node so that I can see the results.
Finally, we drop TEMP_ID_53 because we’re done with the table.
This post was about 70% legwork and 30% guesswork. That’s a bit higher a percentage than I’d ideally like, but there isn’t that much information readily available yet, so I’m trying (in my own small way) to fix that.
Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table). We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.
Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats. The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.
The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.
Any Select query fails with the following error.
Msg 106000, Level 16, State 1, Line 1
Java heap space
Illegal input may cause the java out of memory error. In this particular case the file was not in UTF8 format. DMS tries to read the whole file as one row since it cannot decode the row delimiter and runs into Java heap space error.
Convert the file to UTF8 format since PolyBase currently requires UTF8 format for text delimited files.
I imagine that this page will get quite a few hits over the years, as there currently exists limited information on how to solve these issues if you run into them, and some of the error messages (especially the one quoted above) have nothing to do with root causes.
Polybase offers the ability to create statistics on tables, the same way that you would on normal tables. There are a few rules about statistics:
Stats are not auto-created. You need to create all statistics manually.
Stats are not auto-updated. You will need to update all statistics manually, and currently, the only way you can do that is to drop and re-create the stats.
When you create statistics, SQL Server pulls the data into a temp table, so if you have a billion-row table, you’d better have the tempdb space to pull that off. To mitigate this, you can run stats on a sample of the data.
Round one did not end on a high note, so we’ll see what round two has to offer.
An interesting thing about FIELD_TERMINATOR is that it can be multi-character. MSDN uses ~|~ as a potential delimiter. The reason you’d look at a multi-character delimiter is that not all file formats handle quoted identifiers—for example, putting quotation marks around strings that have commas in them to indicate that commas inside quotation marks are punctuation marks rather than field separators—very well. For example, the default Hive SerDe (Serializer and Deserializer) does not handle quoted identifiers; you can easily grab a different SerDe which does offer quoted identifiers and use it instead, or you can make your delimiter something which is guaranteed not to show up in the file.
You can also set some defaults such as date format, string format, and data compression codec you’re using, but we don’t need those here. Read the MSDN doc above if you’re interested in digging into that a bit further.
It’s a bit of a read, but the end result is that we can retrieve data from a Hadoop cluster as though it were coming from a standard SQL Server table. This is easily my favorite feature in SQL Server 2016.