If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
POPULATION <> 'A'
It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.
Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.
docker psto figure out your container ID. Mine is
818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.
I actually fired up my copy of SimCity 2000 to take a screenshot for this post. The things I do for my audience.
Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.
With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.
I want this to get even better, to the point where external tables are a no-brainer over linked servers in terms of performance.
Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.
There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.
Let me tell you about one of my least favorite things I like to see in PolyBase:
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.
Click through for four potential solutions to what ails you.
My initial plan was to google things. The specific error:
java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.
After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.
Click through for the solution.
Before I start, let’s talk about predicate pushdown for a moment. The gist of it is that when you have data in two sources, you have two options for combining the data:
1. Bring the data in its entirety from your remote source to your local target and work as though everything were in the local target to begin with. I’ll call this the streaming approach.
2. Send as much of your query’s filters, projections, and pre-conditions to the remote source, have the remote source perform some of the work, and then have the remote source send its post-operative data to the local target. Then, the local target once more treats this as though it were simply local data. This is the pushdown approach because you push down those predicates (that is, filters, projections, and pre-conditions).
Click through for the unfortunate finding and also vote up my UserVoice feature request if you want to see string columns as filters.
SQL Server Polybase Services lets us to pull data from other data resources by using T-SQL queries. SQL Server 2019 introduces new connectors for Polybase services like Oracle, Teradata and MongoDB. In one of the SQL Server 2019 presentations from Bob Ward, I saw the CosmosDB logo when he was talking about the new connectors of SQL Server 2019. CosmosDB already has an ODBC driver and you can use it as a datasource for your Power BI, SSIS or SSMS. SQL Server 2019 makes this connection easier by using the Polybase services.
In this post, I am going to show you how to configure SQL Server 2019 Polybase to connect Azure CosmosDB Mongo databases
I have a bit of a vested interest in this, so it is heartening to see people trying out PolyBase. It’s more heartening when they use it after they try it out.
Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being
quickstart.clouderalike the host name, is actually
localhost. You can change this in
Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.
I’m used to the Hortonworks Data Platform, so this is a quick compendium of things I noticed to were different.
If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part
The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.