Polybase In Azure SQL Data Warehouse

Simon Whiteley loves Polybase as much as I do:

“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”

That’s not a quote – I just thought it would be more impactful looking like one!

For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.

For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.

Simon also covers limitations in Polybase:

Push-down predicates

This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.

It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.

Related Posts

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster: If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark: import org.apache.spark.sql.functions._ spark.sql(""" SELECT INT(SUMLEV) AS SummaryLevel, INT(COUNTY) AS CountyID, INT(PLACE) AS PlaceID, BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography, NAME AS Name, POPTYPE AS PopulationType, […]

Read More

PolyBase on Linux

I have a post showing how to set up PolyBase on Linux: Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on. First, run docker […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930