Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is not always available. In those cases you need to know which rows failed to load and why.
In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.
This is a big improvement, one that I hope to see on the on-prem product.
This article shows you how to troubleshoot a failed installation of SQL Server and how to implement a workaround to allow SQL Server 2017’s PolyBase feature to be installed when version 9 of the Java Runtime Environment (JRE) is present. An installation of all features in SQL Server 2017 has three external dependencies. Python, R, and the JRE are third party or open source software needed in a full installation. Changes to external software after the release of SQL Server 2017 can introduce breaking changes. Oracle, the company that owns Java, changed how Windows registry keys are named. This caused a breaking change for SQL Server 2017. Version 8 of the JRE is compatible with the SQL Server 2017 installer. Version 9 of the JRE is not. If version 9 of the JRE is the only version of the JRE on a Windows machine, it is not possible to install the PolyBase feature. The JRE version bug also is found in the SQL Server 2016 installer. The same workaround works for both SQL Server 2016 and 2017.
But now we run into a problem: there are certain ports which need to be open for Polybase to work. This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs. This goes back to the issue we see with spinning up data nodes in Docker: ports are not available. If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.
As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.
Supporting this configuration allows PolyBase to connect and query Hadoop clusters that have wire encryption turned on. This enables a secure connection between Hadoop and SQL Server; as well as, among the Hadoop Data Nodes.
To connect to a Hadoop cluster with the hadoop.rpc.protection set to privacy or integrity, you will need to alter the core-site.xml file that is installed with PolyBase. This file is generally found at C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf.
That’s good news for Polybase users.
Simon Whiteley continues his Polybase on Azure SQL Data Warehouse series. First, he covers data loading patterns:
That’s enough about data loading for now, there’s another major use case for Polybase that we haven’t yet discussed. Many data processing solutions have a huge, unwieldy overnight batch job that performs aggregates, lookups, analytics and various other calculations.
However, it is often the case that this is not timely enough for many business requirements. This is where Polybase can help.
If we have an External Table over the newest files, this will read these new records at query time. We can write a view that combines the External Table with our batch-produced table. This will obviously go a little slower than usual, given it has to read data from flat files each time, however the results returned will be up to date.
In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.
General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we’re interested in here is the core data movement itself.
Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or “Create Table as Select” statement.
This is a pair of interesting posts from Simon.
“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”
That’s not a quote – I just thought it would be more impactful looking like one!
For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.
For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.
This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:
SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016
This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.
It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.
SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.
PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).
If you’re curious about Azure SQL Data Warehouse, this is a must-read.
For federated queries: “N” requires all data from the source to be copied into SQL Server 2016 and then filtered. For “Y”, the query is pushed down into the data source and only the results are returned back, which can be much faster for large amounts of data.
I mention “Maybe” for age out data in SQL DW as you can use PolyBase to access the aged-out data in blob or Azure Data Lake Storage (ADLS), but it will have to import all the data so may have slower performance (which is usually ok for accessing data that is aged-out). For SQL Server 2016, it will have to import the data unless you use HDP/Cloudera, in which case the creation of the MapReduce job will add overhead.
The thing that I like about this chart is that the new Polybase sources (SQL Server, Oracle, Teradata, Mongo, and generic ODBC) do support predicate pushdown. For large data sets, that’s huge: it lets the database engine on the opposite end do as much filtering as possible before sending results back to your SQL Server head node.
Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.
I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.
That’s a header row, and I’m okay with it not making its way in. As a quick aside, I should note that I picked tailnum as my distribution key. The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values. After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.
Click through for more details, including the CETAS statement, which I’d love to see in on-prem SQL Server.