Polybase In Azure SQL Data Warehouse

Simon Whiteley loves Polybase as much as I do:

“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”

That’s not a quote – I just thought it would be more impactful looking like one!

For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.

For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.

Simon also covers limitations in Polybase:

Push-down predicates

This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.

It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.

Related Posts

Polybase Rejected Row Location

Casey Karst announces a nice improvement to Polybase on Azure SQL Data Warehouse: Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy […]

Read More

Troubleshooting Polybase Installation Errors

John Paul Cook has an article looking at what to do when you have Oracle JRE 9 and want to use Polybase: This article shows you how to troubleshoot a failed installation of SQL Server and how to implement a workaround to allow SQL Server 2017’s PolyBase feature to be installed when version 9 of […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930