Polybase Execution Plan With Blob Storage

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Related Posts

PolyBase — SQL to SQL

I have a post covering PolyBase from SQL Server to SQL Server: Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other […]

Read More

Memory Pressure and Azure SQL Managed Instances

Jovan Popovic takes us through determining whether we have enough memory on an Azure SQL Managed Instance: Managed Instance has memory that is proportional to the number of cores. As an example, in Gen5 architecture you have 5.1GB of memory per vCore, meaning that 8-core instance will have 41GB memory. Periodically you should check is […]

Read More


December 2016
« Nov Jan »