I look at the DMVs associated with Polybase and external table creation:
Let’s walk through this one step at a time and understand what the DMV is telling us. Unfortunately, the DMV documentation is a little sparse, so some of this is guesswork on my part.
-
A RandomIDOperation appears to create a temporary table. In this case, the table (whose name is randomly generated) is named TEMP_ID_53. I’m not sure where that name comes from; the session I ran this from was 54, so it wasn’t a session ID.
-
After the table gets created, each Compute node gets told to create a table called TMP_ID_53 in tempdb whose structure matches our external table’s structure. One thing you can’t see from the screenshot is that this table is created with DATA_COMPRESSION = PAGE. I have to wonder if that’d be the same if my Compute node were on Standard edition.
-
We add an extended property on the table, flagging it as IS_EXTERNAL_STREAMING_TABLE.
-
We then update the statistics on that temp table based on expected values. 629 rows are expected here.
-
Then, we create the dest stat, meaning that the temp table now has exactly the same statistics as our external table.
-
The next step is that the Head node begins a MultiStreamOperation, which tells the Compute nodes to begin working. This operator does not show up in the documentation, but we can see that the elapsed time is 58.8 seconds, which is just about as long as my query took. My guess is that this is where the Head node passes code to the Compute nodes and tells them what to do.
-
We have a HadoopRoundRobinOperation on DMS, which stands for “Data Movement Step” according to the location_type documentation. What’s interesting is that according to the DMV, that operation is still going. Even after I checked it 40 minutes later, it still claimed to be running. If you check the full query, it’s basically a SELECT * from our external table.
-
Next is a StreamingReturnOperation, which includes our predicate WHERE dest = ‘ORD’ in it. This is a Data Movement Step and includes all of the Compute nodes (all one of them, that is) sending data back to the Head node so that I can see the results.
-
Finally, we drop TEMP_ID_53 because we’re done with the table.
This post was about 70% legwork and 30% guesswork. That’s a bit higher a percentage than I’d ideally like, but there isn’t that much information readily available yet, so I’m trying (in my own small way) to fix that.