Category: Polybase

PolyBase in SQL Server 2022: Cosmos DB via MongoDB API

Published 2022-08-31 by Kevin Feasel

I have gotten back on the data virtualization wagon:

Back in the 2019 days, I noted a problem when CU2 of SQL Server 2019 came out. This is because the Cosmos DB collection I was using reported a wire version of 2 rather than the minimum version of 3. The official fix at that time was to create a new collection using the then-latest version of 3.6 but that didn’t work for me. My workaround was to use the old MongoDB drivers that shipped with SQL Server 2019 RTM.
Well, as of 2022, that solution won’t work anymore. The original MongoDB drivers don’t ship with SQL Server 2022, so we can’t use that workaround. I had a Cosmos DB account that was originally built on version 3.6. Even after upgrading to server version 4.2, it still reported wire version 2 when I connected to the endpoint that was relevant 3 years ago. Therein lies the solution to the problem.

It turns out there are two viable solutions now and I show both of them.

Comments closed

SQL Server 2022 and S3 Object Integration

Published 2022-08-15 by Kevin Feasel

Anthony Nocentino does some PolyBase work:

In this blog post, I’ve implemented two example environments for using SQL Server 2022’s s3 object integration. One for backup and restore to s3 compatible object storage and the other for data virtualization using Polybase connectivity to s3 compatible object storage. This work aims to get you up and running as quickly as possible to work with these new features. I implemented this in Docker Compose since that handles all the implementation and configuration steps for you. The complete code for this is available on my GitHub repo…I’m walking you through the implementation here in this post.
In my post Setting up MinIO for SQL Server 2022 s3 Object Storage Integration we did this step by step at the command line. Using Docker Compose it will do all the hard work for you and you can get up and running fast.
Let’s walk through what you’ll get in each environment.

Read the whole thing, especially the note at the bottom about minimum RAM requirements on the Docker container.

Comments closed

Error Calling SQLSetDescRec via PolyBase

Published 2022-04-01 by Kevin Feasel

Nathan Schoenack troubleshoots an error:

When trying to query an external table created for a generic ODBC external data source, the following error can be observed:
Message 7320, level 16, state 110, line 87
Unable to execute query “Remote Query” against OLE DB provider “MSOLEDBSQL” on link server “(null)”. 105082; Generic ODBC error: OdbcBufferReader.ReadBuffer, error in OdbcReadBuffer: SqlState: IM001, NativeError: 0, ‘Error calling: SQLSetDescRec(this->GetHdesc(), (SQLSMALLINT)column->idxServerCol, (SQLSMALLINT)column->odbcReadType, 0, column->valueLength, (SQLSMALLINT)column->precision, (SQLSMALLINT)column->scale, (SQLPOINTER)(pBuffer + column->valueOffset), (SQLLEN *)indPtr, (SQLLEN *)indPtr), SQL return code: -1 | SQL Error Info: Error <1>: ErrorMsg: [Microsoft][ODBC Driver Manager] The driver does not support this function. | Error calling: pReadConn->ReadBuffer(pBuffer, bufferOffset, bufferLength, pBytesRead, pRowsRead) | state: FFFF, number: 239, active connections: 9’, Connection String: Dsn={DSN Name};Driver={Driver Name};uid=root;server=xxxxx;port=xxxx;database=xxxx.

Read on for a viable workaround.

Comments closed

Data Virtualization with Azure SQL Managed Instance

Published 2022-03-09 by Kevin Feasel

Mladen Andzic announces data virtualization in Azure SQL Managed Instance:

Data virtualization capabilities, now in preview in Azure SQL Managed Instance, enable you to execute Transact-SQL (T-SQL) queries against data from files stored in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. Currently supported file formats are Parquet, CSV, and JSON.

I’m going to start calling it PolyBase Duck Typing: it’s not actually PolyBase but the syntax is the same and the outcome is the same and the method to enable it is the same and “PolyBase” is a lot easier to say than “data virtualization.” So even though it’s not PolyBase, I’m going to call it PolyBase until there’s a meaningful split.

Comments closed

Thoughts on the Long Run: PolyBase

Published 2022-03-01 by Kevin Feasel

I have some thoughts on a recent announcement:

We could see the writing on the wall here ever since Cloudera and Hortonworks merged. Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP) were both on-premises offerings that you could also get in the cloud. Post-merger, Cloudera Data Platform (CDP) was cloud-only and, to my knowledge, they have never released an on-premises version. Cloud versus on-premises isn’t itself the issue but it does tie in with the issue: in order for PolyBase to work, certain ports need to be exposed on your Hadoop cluster. Cloud offerings tend not to want to expose a bunch of ports to internal services and so PolyBase to CDP was a non-starter.

It’s about 30% bad news, 50% good news, and 20% meh news. Click through for the longer-form version of that.

Comments closed

SQL Server Analytics Updates

Published 2022-02-28 by Kevin Feasel

The SQL Server team drops bad news on a Friday:

Today, we are announcing the retirement of PolyBase scale-out groups in Microsoft SQL Server. Scale-out group functionality will be removed from the product in SQL Server 2022. In-market SQL Server 2019, 2017, and 2016 will continue to support the functionality to the end of support for those products.

In addition to killing Big Data Clusters, they’re also killing the Java connector in PolyBase and scale-out groups. I have a blog post coming up today on the topic with my full set of thoughts. The short version is, “Mostly not bad, though losing scale-out groups sucks.”

Comments closed

Reading Flat Files from PolyBase

Published 2022-02-22 by Kevin Feasel

I try something out:

I really like PolyBase’s ability to read from Azure Blob Storage. But what about reading from files on disk? I knew that we could get Excel files working but how about regular ol’ text files? It turns out that you can using the Microsoft Access Text Driver.

I won’t say it’s amazing but it does the trick.

Comments closed

Build a Sandbox for Testing PolyBase and Hadoop

Published 2022-01-11 by Kevin Feasel

Fernando Sibaja Araya has a step-by-step guide to building a Hadoop sandbox for testing PolyBase on SQL Server:

This guide will take you step by step into deploying a hadoop sandbox into Azure. You then will connect to the sandbox through SSH and tunnel all the required ports to your machine so you can access all the endpoints to execute hadoop queries from Polybase.
We will be deploying Hortonworks Data Platform Sandbox 2.6.4. This will be 1 VM running in azure and within this VM a docker container will have all the HDP services running.

Click through for the full set of instructions. I’m a little overjoyed that my blog snuck into the set of links and resources at the end.

Comments closed

Loading Azure Synapse Analytics using PolyBase

Published 2021-03-19 by Kevin Feasel

Gauri Mahajan needs to load some data:

Azure Synapse Analytics is Microsoft’s data warehousing offering on Azure Cloud. It supports three types of runtimes – SQL Serverless Pool, SQL Dedicated Pool, and Spark Pools. As there are a variety of data sources on Azure, it’s very obvious that there can be varying types and volumes of data that would have to be loaded into Azure Synapse pools. There are three major types of data ingestion approaches that can be used to load data into Synapse. The COPY command is the most flexible and elaborate mechanism, where someone can execute this command from a SQL pool to load data from supported data repositories. This command is convenient to load ad-hoc and small to medium-sized data loads into Synapse. The second method of loading data is the Bulk Insert, where the method name is self-relevant regarding the approach functionality. To ingest the data from supported repositories into dedicated SQL pools, PolyBase is as efficient and at times it’s even more efficient than the COPY command. This article will help you understand the process to ingest data into Azure Synapse Analytics using PolyBase to load the data.

Click through for the process.

Comments closed

Creating Parquet Files from SQL Server Data

Published 2021-03-16 by Kevin Feasel

Andy Leonard answers a challenge:

I searched and found some promising Parquet SSIS components available from CData Software and passed that information along. I shared my inexperience in exporting to parquet format and asked a few friends how they’d done it.
I thought: How many times have I demonstrated Azure Data Factory and clicked right past file format selection without giving Parquet a second thought? Too many times. It was time to change that.

Another route is to use PolyBase. If you’re okay with writing the results to Azure Blob Storage, you can insert directly into Parquet files the results of a SQL query. If that sounds interesting, here are posts on connecting to Azure Blob Storage via PolyBase and inserting into Azure Blob Storage. I insert in CSV format to make it easier for people to follow, but swap the file format with Parquet and it works all the same.

Comments closed