Press "Enter" to skip to content

Category: Polybase

SQL Server 2022 and Azure Data Lake Storage

I continue a series on data virtualization in SQL Server 2022:

Today’s post is a fairly short one. Let’s say you want to connect to Azure Data Lake Storage Gen2 from SQL Server 2022. We saw in a recent post how to connect to Azure Blob Storage. When it comes to Azure Data Lake Storage, the story is almost the same, though there are a couple pitfalls you will want to avoid.

Read on for that error, which stymied me for a good 10 minutes.

Comments closed

Inserting into Azure Blob Storage from SQL Server 2022

I continue a series on data virtualization in SQL Server 2022:

Several years ago, I wrote a blog post on how to insert data into Azure Blob Storage from SQL Server using PolyBase. That technique used PolyBase V1: the Java connector for Hadoop. With SQL Server 2022 eliminating that connector, we’re going to learn the new method.

This is one of the larger practical differences in data virtualization with SQL Server 2022.

Comments closed

PolyBase and Named Instances

I show how to connect to a named instance using PolyBase in SQL Server 2019 or 2022:

We have two SQL Server instances running on the same machine. Before we get started, I do want to point out one thing: PolyBase can only work on one instance for a given server (physical machine or virtual machine) because the PolyBase engine and data movement services are system-level services. This means you cannot have PolyBase installed on your main instance as well as your named instance.

Click through for two methods.

Comments closed

Connecting to Azure Blob Storage with SQL Server 2022

I take a look back at the past and forward to the future:

PolyBase used to connect to Azure Blob Storage. Specifically, you could use the wasbs:// protocol and connect to Azure Blob Storage over WebHDFS. Here’s an example of an external data source which would work for SQL Server 2016 through 2019:

SQL Server 2022 changes its mechanisms around Azure Blob Storage a little bit, though I think the changes are sensible.

Comments closed

PolyBase and Windows Authentication in SQL Server 2022

I have some bad news:

Back in the SQL Server 2019 days, I laid out a post which covered a really annoying bug around Windows authentication and PolyBase. The upshot is that SQL Server 2019 CU2 broke Windows authentication, meaning that you could not create external objects on a SQL Server instance while using a Windows authenticated session. In CU2, if you tried, you would get a Level 20 error which would kill your session.

Over the next several CUs, the SQL Server team improved things incrementally: first, the error that you got was a Level 16 error, so your statement would fail but the session would remain active. Then, in CU6, they made it so that if your SQL Server instance is on a server running on a domain, then you would be okay—you could create external objects while connected with a domain-authenticated Windows account.

I decided to give this a try in SQL Server 2022 RC0 and ran into some interesting results. 

Click through for those results.

Comments closed

PolyBase and Cosmos DB’s Core API

I have some fun integrating the Cosmos DB Core API with PolyBase:

PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.

Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.

Read on to see how it works and how you can get around some initial pain points. As a quick note, this only works with SQL Server on Windows, as SQL Server on Linux does not support generic ODBC drivers with PolyBase.

Comments closed

PolyBase in SQL Server 2022: Cosmos DB via MongoDB API

I have gotten back on the data virtualization wagon:

Back in the 2019 days, I noted a problem when CU2 of SQL Server 2019 came out. This is because the Cosmos DB collection I was using reported a wire version of 2 rather than the minimum version of 3. The official fix at that time was to create a new collection using the then-latest version of 3.6 but that didn’t work for me. My workaround was to use the old MongoDB drivers that shipped with SQL Server 2019 RTM.

Well, as of 2022, that solution won’t work anymore. The original MongoDB drivers don’t ship with SQL Server 2022, so we can’t use that workaround. I had a Cosmos DB account that was originally built on version 3.6. Even after upgrading to server version 4.2, it still reported wire version 2 when I connected to the endpoint that was relevant 3 years ago. Therein lies the solution to the problem.

It turns out there are two viable solutions now and I show both of them.

Comments closed

SQL Server 2022 and S3 Object Integration

Anthony Nocentino does some PolyBase work:

In this blog post, I’ve implemented two example environments for using SQL Server 2022’s s3 object integration. One for backup and restore to s3 compatible object storage and the other for data virtualization using Polybase connectivity to s3 compatible object storage. This work aims to get you up and running as quickly as possible to work with these new features. I implemented this in Docker Compose since that handles all the implementation and configuration steps for you. The complete code for this is available on my GitHub repo…I’m walking you through the implementation here in this post.

In my post Setting up MinIO for SQL Server 2022 s3 Object Storage Integration we did this step by step at the command line. Using Docker Compose it will do all the hard work for you and you can get up and running fast.

Let’s walk through what you’ll get in each environment.

Read the whole thing, especially the note at the bottom about minimum RAM requirements on the Docker container.

Comments closed

Error Calling SQLSetDescRec via PolyBase

Nathan Schoenack troubleshoots an error:

When trying to query an external table created for a generic ODBC external data source, the following error can be observed:

Message 7320, level 16, state 110, line 87

Unable to execute query “Remote Query” against OLE DB provider “MSOLEDBSQL” on link server “(null)”. 105082; Generic ODBC error: OdbcBufferReader.ReadBuffer, error in OdbcReadBuffer: SqlState: IM001, NativeError: 0, ‘Error calling: SQLSetDescRec(this->GetHdesc(), (SQLSMALLINT)column->idxServerCol, (SQLSMALLINT)column->odbcReadType, 0, column->valueLength, (SQLSMALLINT)column->precision, (SQLSMALLINT)column->scale, (SQLPOINTER)(pBuffer + column->valueOffset), (SQLLEN *)indPtr, (SQLLEN *)indPtr), SQL return code: -1 | SQL Error Info: Error <1>: ErrorMsg: [Microsoft][ODBC Driver Manager] The driver does not support this function. | Error calling: pReadConn->ReadBuffer(pBuffer, bufferOffset, bufferLength, pBytesRead, pRowsRead) | state: FFFF, number: 239, active connections: 9’, Connection String: Dsn={DSN Name};Driver={Driver Name};uid=root;server=xxxxx;port=xxxx;database=xxxx.

Read on for a viable workaround.

Comments closed

Data Virtualization with Azure SQL Managed Instance

Mladen Andzic announces data virtualization in Azure SQL Managed Instance:

Data virtualization capabilities, now in preview in Azure SQL Managed Instance, enable you to execute Transact-SQL (T-SQL) queries against data from files stored in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. Currently supported file formats are Parquet, CSV, and JSON.

I’m going to start calling it PolyBase Duck Typing: it’s not actually PolyBase but the syntax is the same and the outcome is the same and the method to enable it is the same and “PolyBase” is a lot easier to say than “data virtualization.” So even though it’s not PolyBase, I’m going to call it PolyBase until there’s a meaningful split.

Comments closed