Polybase In Azure SQL Data Warehouse

Simon Whiteley loves Polybase as much as I do:

“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”

That’s not a quote – I just thought it would be more impactful looking like one!

For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.

For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.

Simon also covers limitations in Polybase:

Push-down predicates

This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.

It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.

Related Posts

Data Transformation Tools In The Azure Space

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure: If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. […]

Read More

Keeping Polybase Tables In Sync With Biml

Ben Weissman combines two of my favorite things: If you have started playing with polybase, you probably figured out by now, that – as awesome as it is – it’s still a bit of a pain to set it up and maintain external tables. There is a wizard in Azure Data Studio but it’s still […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930