While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.
SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.
SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.
Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.
Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.
I’ve wanted Spark integration ever since 2016 and we’re going to get it.
Break down data silos and deliver one view across all of your data using data virtualization. Starting in SQL Server 2016, PolyBase has enabled you to run a T-SQL query inside SQL Server to pull data from your data lake and return it in a structured format—all without moving or copying the data. Now in SQL Server 2019, we’re expanding that concept of data virtualization to additional data sources, including Oracle, Teradata, MongoDB, PostgreSQL, and others. Using the new PolyBase, you can break down data silos and easily combine data from many sources using virtualization to avoid the time, effort, security risks and duplicate data created by data movement and replication. New elastically scalable “data pools” and “compute pools” make querying virtualized data lighting fast by caching data and distributing query execution across many instances of SQL Server.
Just in time for me to scramble to update Polybase slides for Conference Season…
Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is not always available. In those cases you need to know which rows failed to load and why.
In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.
This is a big improvement, one that I hope to see on the on-prem product.
This article shows you how to troubleshoot a failed installation of SQL Server and how to implement a workaround to allow SQL Server 2017’s PolyBase feature to be installed when version 9 of the Java Runtime Environment (JRE) is present. An installation of all features in SQL Server 2017 has three external dependencies. Python, R, and the JRE are third party or open source software needed in a full installation. Changes to external software after the release of SQL Server 2017 can introduce breaking changes. Oracle, the company that owns Java, changed how Windows registry keys are named. This caused a breaking change for SQL Server 2017. Version 8 of the JRE is compatible with the SQL Server 2017 installer. Version 9 of the JRE is not. If version 9 of the JRE is the only version of the JRE on a Windows machine, it is not possible to install the PolyBase feature. The JRE version bug also is found in the SQL Server 2016 installer. The same workaround works for both SQL Server 2016 and 2017.
But now we run into a problem: there are certain ports which need to be open for Polybase to work. This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs. This goes back to the issue we see with spinning up data nodes in Docker: ports are not available. If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.
As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.
Supporting this configuration allows PolyBase to connect and query Hadoop clusters that have wire encryption turned on. This enables a secure connection between Hadoop and SQL Server; as well as, among the Hadoop Data Nodes.
To connect to a Hadoop cluster with the hadoop.rpc.protection set to privacy or integrity, you will need to alter the core-site.xml file that is installed with PolyBase. This file is generally found at C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf.
That’s good news for Polybase users.
Simon Whiteley continues his Polybase on Azure SQL Data Warehouse series. First, he covers data loading patterns:
That’s enough about data loading for now, there’s another major use case for Polybase that we haven’t yet discussed. Many data processing solutions have a huge, unwieldy overnight batch job that performs aggregates, lookups, analytics and various other calculations.
However, it is often the case that this is not timely enough for many business requirements. This is where Polybase can help.
If we have an External Table over the newest files, this will read these new records at query time. We can write a view that combines the External Table with our batch-produced table. This will obviously go a little slower than usual, given it has to read data from flat files each time, however the results returned will be up to date.
In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.
General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we’re interested in here is the core data movement itself.
Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or “Create Table as Select” statement.
This is a pair of interesting posts from Simon.
“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”
That’s not a quote – I just thought it would be more impactful looking like one!
For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.
For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.
This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:
SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016
This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.
It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.
SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.
PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).
If you’re curious about Azure SQL Data Warehouse, this is a must-read.