Polybase DMVs

I look at the DMVs associated with Polybase and external table creation:

Let’s walk through this one step at a time and understand what the DMV is telling us.  Unfortunately, the DMV documentation is a little sparse, so some of this is guesswork on my part.

  1. A RandomIDOperation appears to create a temporary table.  In this case, the table (whose name is randomly generated) is named TEMP_ID_53.  I’m not sure where that name comes from; the session I ran this from was 54, so it wasn’t a session ID.

  2. After the table gets created, each Compute node gets told to create a table called TMP_ID_53 in tempdb whose structure matches our external table’s structure.  One thing you can’t see from the screenshot is that this table is created with DATA_COMPRESSION = PAGE.  I have to wonder if that’d be the same if my Compute node were on Standard edition.

  3. We add an extended property on the table, flagging it as IS_EXTERNAL_STREAMING_TABLE.

  4. We then update the statistics on that temp table based on expected values.  629 rows are expected here.

  5. Then, we create the dest stat, meaning that the temp table now has exactly the same statistics as our external table.

  6. The next step is that the Head node begins a MultiStreamOperation, which tells the Compute nodes to begin working.  This operator does not show up in the documentation, but we can see that the elapsed time is 58.8 seconds, which is just about as long as my query took.  My guess is that this is where the Head node passes code to the Compute nodes and tells them what to do.

  7. We have a HadoopRoundRobinOperation on DMS, which stands for “Data Movement Step” according to the location_type documentation.  What’s interesting is that according to the DMV, that operation is still going.  Even after I checked it 40 minutes later, it still claimed to be running.  If you check the full query, it’s basically a SELECT * from our external table.

  8. Next is a StreamingReturnOperation, which includes our predicate WHERE dest = ‘ORD’ in it.  This is a Data Movement Step and includes all of the Compute nodes (all one of them, that is) sending data back to the Head node so that I can see the results.

  9. Finally, we drop TEMP_ID_53 because we’re done with the table.

This post was about 70% legwork and 30% guesswork.  That’s a bit higher a percentage than I’d ideally like, but there isn’t that much information readily available yet, so I’m trying (in my own small way) to fix that.

Related Posts

Safely Dropping Databases

Bob Pusateri notes a little issue when it comes to dropping databases: At a previous employer, we had a well-defined process when dropping databases for a client. It went something like this: Confirm in writing the databases on which servers/instances to be dropped Take a final full backup of databases Take databases offline Wait at […]

Read More

Generating Load For Kafka With JMeter

Anup Shirolkar shows us a way to use JMeter to generate load for Apache Kafka clusters: The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930