PolyBase and External Column Names

I have another post looking at external columns on PolyBase V2 data sources:

I’m going to use external two tables in this experiment. In the left corner, we have some ORC files stored in Azure Blob Storage which we’ll represent as FireIncidents2017. In the right corner, we have data stored in a remote SQL Server instance which we’ll call LineItem. The data doesn’t really matter that much, but to give you an idea of where we’re going, I’ll show each table. 

There’s quite a bit you can do here.

PolyBase and Azul Zulu OpenJDK

I have a post looking at one of the more interesting changes in SQL Server 2019 CTP 3.2:

One of the more interesting parts of SQL Server 2019 CTP 3.2’s release notes is the relationship between Microsoft and Azul Systems. Travis Wright covers it in some detail, as well as what it means for customers.

Prior to SQL Server 2019 CTP 3.2, installing PolyBase required an installation of Oracle’s Java Runtime Environment 7 Update 51 or higher, either directly from Oracle or through OpenJDK.

Java is still required if you want to read from or write to Hadoop or Azure Blob Storage. Oracle’s flavor of Java is no longer required, however.

Azul Java in SQL Server 2019

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.

Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

PolyBase in SQL Server 2019

Ben Weissman takes us through SQL Server 2019’s PolyBase enhancements:

Isn’t that the same thing, as a linked server?
At first sight, it sure looks like it. But there are a couple of differences. Linked Servers are instance scoped, whereas PolyBase is database scoped, which also means that PolyBase will automatically work across availability groups. Linked Servers use OLEDB providers, while PolyBase uses ODBC. There are a couple more, like the fact that PolyBase doesn’t support integrated security, but the most significant difference from a performance perspective is PolyBase’s capability to scale out – Linked Servers are single-threaded.

Read the whole thing. Ben asks and answers the question of whether PolyBase replaces ETL. You’ll want to read his answer. My answer (and I won’t tell you how close it is to his because I want you to read his article) is that PolyBase will only replace a fraction of total ETL and will act as an ETL process in a larger percentage of cases. I can see a pattern where you virtualize the data as external tables and then connect them together locally to insert into local facts and dimensions, for example. But there are too many things you can do with other ETL platforms which make me say this will never be a full replacement.

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
spark.sql("""
SELECT
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
FROM NorthCarolinaPopulation
WHERE
POPULATION <> 'A'
""")
.write.format("orc").saveAsTable("NorthCarolinaPopulationTyped")

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

PolyBase on Linux

I have a post showing how to set up PolyBase on Linux:

Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.

First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.

I actually fired up my copy of SimCity 2000 to take a screenshot for this post. The things I do for my audience.

PolyBase — SQL to SQL

I have a post covering PolyBase from SQL Server to SQL Server:

Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.

With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.

I want this to get even better, to the point where external tables are a no-brainer over linked servers in terms of performance.

DW Databases in PolyBase

I look at some databases people tend to ignore:

Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.

There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class:

Let me tell you about one of my least favorite things I like to see in PolyBase:

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.

Click through for four potential solutions to what ails you.

PolyBase and Hive Shim Errors

I ran into a problem with Hive 3 and PolyBase:

My initial plan was to google things. The specific error: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.

After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.

Click through for the solution.

Categories

August 2019
MTWTFSS
« Jul  
 1234
567891011
12131415161718
19202122232425
262728293031