Press "Enter" to skip to content

Category: Polybase

Thoughts on the Long Run: PolyBase

I have some thoughts on a recent announcement:

We could see the writing on the wall here ever since Cloudera and Hortonworks merged. Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP) were both on-premises offerings that you could also get in the cloud. Post-merger, Cloudera Data Platform (CDP) was cloud-only and, to my knowledge, they have never released an on-premises version. Cloud versus on-premises isn’t itself the issue but it does tie in with the issue: in order for PolyBase to work, certain ports need to be exposed on your Hadoop cluster. Cloud offerings tend not to want to expose a bunch of ports to internal services and so PolyBase to CDP was a non-starter.

It’s about 30% bad news, 50% good news, and 20% meh news. Click through for the longer-form version of that.

Comments closed

SQL Server Analytics Updates

The SQL Server team drops bad news on a Friday:

Today, we are announcing the retirement of PolyBase scale-out groups in Microsoft SQL Server. Scale-out group functionality will be removed from the product in SQL Server 2022. In-market SQL Server 2019, 2017, and 2016 will continue to support the functionality to the end of support for those products.

In addition to killing Big Data Clusters, they’re also killing the Java connector in PolyBase and scale-out groups. I have a blog post coming up today on the topic with my full set of thoughts. The short version is, “Mostly not bad, though losing scale-out groups sucks.”

Comments closed

Build a Sandbox for Testing PolyBase and Hadoop

Fernando Sibaja Araya has a step-by-step guide to building a Hadoop sandbox for testing PolyBase on SQL Server:

This guide will take you step by step into deploying a hadoop sandbox into Azure. You then will connect to the sandbox through SSH and tunnel all the required ports to your machine so you can access all the endpoints to execute hadoop queries from Polybase.

We will be deploying Hortonworks Data Platform Sandbox 2.6.4. This will be 1 VM running in azure and within this VM a docker container will have all the HDP services running.

Click through for the full set of instructions. I’m a little overjoyed that my blog snuck into the set of links and resources at the end.

Comments closed

Loading Azure Synapse Analytics using PolyBase

Gauri Mahajan needs to load some data:

Azure Synapse Analytics is Microsoft’s data warehousing offering on Azure Cloud. It supports three types of runtimes – SQL Serverless Pool, SQL Dedicated Pool, and Spark Pools. As there are a variety of data sources on Azure, it’s very obvious that there can be varying types and volumes of data that would have to be loaded into Azure Synapse pools. There are three major types of data ingestion approaches that can be used to load data into Synapse. The COPY command is the most flexible and elaborate mechanism, where someone can execute this command from a SQL pool to load data from supported data repositories. This command is convenient to load ad-hoc and small to medium-sized data loads into Synapse. The second method of loading data is the Bulk Insert, where the method name is self-relevant regarding the approach functionality. To ingest the data from supported repositories into dedicated SQL pools, PolyBase is as efficient and at times it’s even more efficient than the COPY command. This article will help you understand the process to ingest data into Azure Synapse Analytics using PolyBase to load the data.

Click through for the process.

Comments closed

Creating Parquet Files from SQL Server Data

Andy Leonard answers a challenge:

I searched and found some promising Parquet SSIS components available from CData Software and passed that information along. I shared my inexperience in exporting to parquet format and asked a few friends how they’d done it.

I thought: How many times have I demonstrated Azure Data Factory and clicked right past file format selection without giving Parquet a second thought? Too many times. It was time to change that.

Another route is to use PolyBase. If you’re okay with writing the results to Azure Blob Storage, you can insert directly into Parquet files the results of a SQL query. If that sounds interesting, here are posts on connecting to Azure Blob Storage via PolyBase and inserting into Azure Blob Storage. I insert in CSV format to make it easier for people to follow, but swap the file format with Parquet and it works all the same.

Comments closed

Connecting to Postgres with PolyBase

I clear one blog post off my backlog:

Now that we have some data, let’s go back to SQL Server. I assume you’ve already installed and configured PolyBase—if not, check out my presentation on PolyBase. Note that this requires SQL Server 2019 or later, as that’s the first version which supports PolyBase to ODBC. Here’s a script which assumes a database named Scratch and a master key <<SomeSecureKey>>.

Click through for step-by-step instructions to get started, though I will freely admit that I don’t have the Postgres knowledge to give you a full listing of sharp edges.

Comments closed

PolyBase and Availability Groups

Rajendra Gupta has a detailed article on working with PolyBase in an Availability Group:

In this 28th article for SQL Server Always On Availability Group series, we explore the high-availability for the SQL Server PolyBase(SSB) external tables using AG groups.

There’s a lot of detail in the article and it’s worth reading in conjunction with Nathan Schoenack’s post. Someday I’ll get to the blog post on my backlog around PolyBase and AGs, especially with scale-out clusters. Someday.

Comments closed

Querying an AS400 with PolyBase

Lee Markum proves you can read data from an AS400 via PolyBase:

Before I dive into that, why was I interested in this feature? What did I hope to gain? Well, first of all, there was definitely the motivation of wondering, “Can we get this to work?” Secondly, and more practically, the promise of SQL Server Data Virtualization is to make other data sources available without using a Linked Server and without the time it takes to develop an ETL process to move the data. On a related note, you can cut out the time it takes for an ETL job to actually move the data somewhere like a data warehouse or flattened tables for reporting. Third, the Polybase feature has a built in engine that can provide query performance not available via Linked Server. Fourth, I wanted to provide a way for developers to query data in the AS400 without having to learn the different syntax required by the AS400 iSeries. Fifth, query writers can also join the external able to local SQL Server data.

Lee used the ODBC driver functionality; click through to see how that worked and what needed to change.

Comments closed