Press "Enter" to skip to content

Category: Polybase

Data Virtualization in SQL Server 2022

Hugo Queiroz provides an overview of data virtualization options in SQL Server 2022:

SQL Server 2022 now supports CSV, Parquet, and Deltafiles stored on Azure Storage Account v2, Azure Data Lake Storage Gen2, or any simple storage service (S3)–compliant object storage—the last as an on-premises offering or in the cloud. Finally, SQL Server 2022 can now use Create External Table as Select (CETAS), together with commands like OPENROWSETCreate External Table (CET), and all the new T-SQL enhancements. SQL Server 2022 is a powerful data hub.

The post doesn’t get too deep into the topic, though a search here will find you links to articles with concrete examples.

Comments closed

Data Virtualization in Azure SQL Managed Instance

Mladen Andzic has an announcement:

We are excited to announce the general availability (GA) of data virtualization capabilities in Azure SQL Managed Instance, with improved query performance and managed identity as a new supported option for authenticating to storage accounts.  

The data virtualization enables you to execute Transact-SQL (T-SQL) queries on files storing data in common data formats in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. The supported file formats are Parquet, CSV, and JSON.

This is similar to PolyBase in SQL Server 2019 but is a different underlying technology. In SQL Managed Instance, it looks like we only get API-based data virtualization, not the ODBC-based PolyBase we saw in SQL Server 2019.

Comments closed

SQL Server 2022 and Azure Data Lake Storage

I continue a series on data virtualization in SQL Server 2022:

Today’s post is a fairly short one. Let’s say you want to connect to Azure Data Lake Storage Gen2 from SQL Server 2022. We saw in a recent post how to connect to Azure Blob Storage. When it comes to Azure Data Lake Storage, the story is almost the same, though there are a couple pitfalls you will want to avoid.

Read on for that error, which stymied me for a good 10 minutes.

Comments closed

Inserting into Azure Blob Storage from SQL Server 2022

I continue a series on data virtualization in SQL Server 2022:

Several years ago, I wrote a blog post on how to insert data into Azure Blob Storage from SQL Server using PolyBase. That technique used PolyBase V1: the Java connector for Hadoop. With SQL Server 2022 eliminating that connector, we’re going to learn the new method.

This is one of the larger practical differences in data virtualization with SQL Server 2022.

Comments closed

PolyBase and Named Instances

I show how to connect to a named instance using PolyBase in SQL Server 2019 or 2022:

We have two SQL Server instances running on the same machine. Before we get started, I do want to point out one thing: PolyBase can only work on one instance for a given server (physical machine or virtual machine) because the PolyBase engine and data movement services are system-level services. This means you cannot have PolyBase installed on your main instance as well as your named instance.

Click through for two methods.

Comments closed

Connecting to Azure Blob Storage with SQL Server 2022

I take a look back at the past and forward to the future:

PolyBase used to connect to Azure Blob Storage. Specifically, you could use the wasbs:// protocol and connect to Azure Blob Storage over WebHDFS. Here’s an example of an external data source which would work for SQL Server 2016 through 2019:

SQL Server 2022 changes its mechanisms around Azure Blob Storage a little bit, though I think the changes are sensible.

Comments closed

PolyBase and Windows Authentication in SQL Server 2022

I have some bad news:

Back in the SQL Server 2019 days, I laid out a post which covered a really annoying bug around Windows authentication and PolyBase. The upshot is that SQL Server 2019 CU2 broke Windows authentication, meaning that you could not create external objects on a SQL Server instance while using a Windows authenticated session. In CU2, if you tried, you would get a Level 20 error which would kill your session.

Over the next several CUs, the SQL Server team improved things incrementally: first, the error that you got was a Level 16 error, so your statement would fail but the session would remain active. Then, in CU6, they made it so that if your SQL Server instance is on a server running on a domain, then you would be okay—you could create external objects while connected with a domain-authenticated Windows account.

I decided to give this a try in SQL Server 2022 RC0 and ran into some interesting results. 

Click through for those results.

Comments closed

PolyBase and Cosmos DB’s Core API

I have some fun integrating the Cosmos DB Core API with PolyBase:

PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.

Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.

Read on to see how it works and how you can get around some initial pain points. As a quick note, this only works with SQL Server on Windows, as SQL Server on Linux does not support generic ODBC drivers with PolyBase.

Comments closed

PolyBase in SQL Server 2022: Cosmos DB via MongoDB API

I have gotten back on the data virtualization wagon:

Back in the 2019 days, I noted a problem when CU2 of SQL Server 2019 came out. This is because the Cosmos DB collection I was using reported a wire version of 2 rather than the minimum version of 3. The official fix at that time was to create a new collection using the then-latest version of 3.6 but that didn’t work for me. My workaround was to use the old MongoDB drivers that shipped with SQL Server 2019 RTM.

Well, as of 2022, that solution won’t work anymore. The original MongoDB drivers don’t ship with SQL Server 2022, so we can’t use that workaround. I had a Cosmos DB account that was originally built on version 3.6. Even after upgrading to server version 4.2, it still reported wire version 2 when I connected to the endpoint that was relevant 3 years ago. Therein lies the solution to the problem.

It turns out there are two viable solutions now and I show both of them.

Comments closed

SQL Server 2022 and S3 Object Integration

Anthony Nocentino does some PolyBase work:

In this blog post, I’ve implemented two example environments for using SQL Server 2022’s s3 object integration. One for backup and restore to s3 compatible object storage and the other for data virtualization using Polybase connectivity to s3 compatible object storage. This work aims to get you up and running as quickly as possible to work with these new features. I implemented this in Docker Compose since that handles all the implementation and configuration steps for you. The complete code for this is available on my GitHub repo…I’m walking you through the implementation here in this post.

In my post Setting up MinIO for SQL Server 2022 s3 Object Storage Integration we did this step by step at the command line. Using Docker Compose it will do all the hard work for you and you can get up and running fast.

Let’s walk through what you’ll get in each environment.

Read the whole thing, especially the note at the bottom about minimum RAM requirements on the Docker container.

Comments closed