Press "Enter" to skip to content

Category: Polybase

Azul Java in SQL Server 2019

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.

Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

Comments closed

PolyBase in SQL Server 2019

Ben Weissman takes us through SQL Server 2019’s PolyBase enhancements:

Isn’t that the same thing, as a linked server?
At first sight, it sure looks like it. But there are a couple of differences. Linked Servers are instance scoped, whereas PolyBase is database scoped, which also means that PolyBase will automatically work across availability groups. Linked Servers use OLEDB providers, while PolyBase uses ODBC. There are a couple more, like the fact that PolyBase doesn’t support integrated security, but the most significant difference from a performance perspective is PolyBase’s capability to scale out – Linked Servers are single-threaded.

Read the whole thing. Ben asks and answers the question of whether PolyBase replaces ETL. You’ll want to read his answer. My answer (and I won’t tell you how close it is to his because I want you to read his article) is that PolyBase will only replace a fraction of total ETL and will act as an ETL process in a larger percentage of cases. I can see a pattern where you virtualize the data as external tables and then connect them together locally to insert into local facts and dimensions, for example. But there are too many things you can do with other ETL platforms which make me say this will never be a full replacement.

Comments closed

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
spark.sql("""
SELECT
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
FROM NorthCarolinaPopulation
WHERE
POPULATION <> 'A'
""")
.write.format("orc").saveAsTable("NorthCarolinaPopulationTyped")

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

Comments closed

PolyBase on Linux

I have a post showing how to set up PolyBase on Linux:

Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.

First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.

I actually fired up my copy of SimCity 2000 to take a screenshot for this post. The things I do for my audience.

Comments closed

PolyBase — SQL to SQL

I have a post covering PolyBase from SQL Server to SQL Server:

Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.

With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.

I want this to get even better, to the point where external tables are a no-brainer over linked servers in terms of performance.

Comments closed

DW Databases in PolyBase

I look at some databases people tend to ignore:

Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.

There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.

Comments closed

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class:

Let me tell you about one of my least favorite things I like to see in PolyBase:

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.

Click through for four potential solutions to what ails you.

Comments closed

PolyBase and Pushdown Limitations

I have a post covering something I learned about predicate pushdown against Hadoop in PolyBase:

Before I start, let’s talk about predicate pushdown for a moment. The gist of it is that when you have data in two sources, you have two options for combining the data:

1. Bring the data in its entirety from your remote source to your local target and work as though everything were in the local target to begin with. I’ll call this the streaming approach.

2. Send as much of your query’s filters, projections, and pre-conditions to the remote source, have the remote source perform some of the work, and then have the remote source send its post-operative data to the local target. Then, the local target once more treats this as though it were simply local data. This is the pushdown approach because you push down those predicates (that is, filters, projections, and pre-conditions).

Click through for the unfortunate finding and also vote up my UserVoice feature request if you want to see string columns as filters.

Comments closed

Querying CosmosDB with PolyBase

Hasan Savran shows how you can query CosmosDB with T-SQL statements:

SQL Server Polybase Services lets us to pull data from other data resources by using T-SQL queries. SQL Server 2019 introduces new connectors for Polybase services like Oracle, Teradata and MongoDB. In one of the SQL Server 2019 presentations from Bob Ward, I saw the CosmosDB logo when he was talking about the new connectors of SQL Server 2019. CosmosDB already has an ODBC driver and you can use it as a datasource for your Power BI, SSIS or SSMS. SQL Server 2019 makes this connection easier by using the Polybase services.
      
      In this post, I am going to show you how to configure SQL Server 2019 Polybase to connect Azure CosmosDB Mongo databases

I have a bit of a vested interest in this, so it is heartening to see people trying out PolyBase. It’s more heartening when they use it after they try it out.

Comments closed