I’m going to use external two tables in this experiment. In the left corner, we have some ORC files stored in Azure Blob Storage which we’ll represent as FireIncidents2017. In the right corner, we have data stored in a remote SQL Server instance which we’ll call LineItem. The data doesn’t really matter that much, but to give you an idea of where we’re going, I’ll show each table.
There’s quite a bit you can do here.
One of the more interesting parts of SQL Server 2019 CTP 3.2’s release notes is the relationship between Microsoft and Azul Systems. Travis Wright covers it in some detail, as well as what it means for customers.
Prior to SQL Server 2019 CTP 3.2, installing PolyBase required an installation of Oracle’s Java Runtime Environment 7 Update 51 or higher, either directly from Oracle or through OpenJDK.
Java is still required if you want to read from or write to Hadoop or Azure Blob Storage. Oracle’s flavor of Java is no longer required, however.
In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.
Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.
This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.
Isn’t that the same thing, as a linked server?
At first sight, it sure looks like it. But there are a couple of differences. Linked Servers are instance scoped, whereas PolyBase is database scoped, which also means that PolyBase will automatically work across availability groups. Linked Servers use OLEDB providers, while PolyBase uses ODBC. There are a couple more, like the fact that PolyBase doesn’t support integrated security, but the most significant difference from a performance perspective is PolyBase’s capability to scale out – Linked Servers are single-threaded.
Read the whole thing. Ben asks and answers the question of whether PolyBase replaces ETL. You’ll want to read his answer. My answer (and I won’t tell you how close it is to his because I want you to read his article) is that PolyBase will only replace a fraction of total ETL and will act as an ETL process in a larger percentage of cases. I can see a pattern where you virtualize the data as external tables and then connect them together locally to insert into local facts and dimensions, for example. But there are too many things you can do with other ETL platforms which make me say this will never be a full replacement.
If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
POPULATION <> 'A'
It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.
Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.
docker psto figure out your container ID. Mine is
818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.
I actually fired up my copy of SimCity 2000 to take a screenshot for this post. The things I do for my audience.
Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.
With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.
I want this to get even better, to the point where external tables are a no-brainer over linked servers in terms of performance.
Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.
There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.
Let me tell you about one of my least favorite things I like to see in PolyBase:
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.
Click through for four potential solutions to what ails you.
My initial plan was to google things. The specific error:
java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.
After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.
Click through for the solution.