DW Databases in PolyBase

I look at some databases people tend to ignore:

Today is a fairly short post covering a trio of databases you might not even know you have: DWConfiguration, DWDiagnostics, and DWQueue. The PolyBase installer drops all three of these on your instance. Let’s go in ascending order of the number of useful tables.

There are very few useful (to us) tables when using on-prem SQL Server as opposed to APS, but there are a few of note.

MRAppMaster Errors Running MapReduce Jobs

I have a post looking at potential causes when PolyBase MapReduce jobs are unable to find the MRAppMaster class:

Let me tell you about one of my least favorite things I like to see in PolyBase:

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

This error is not limited to PolyBase but is instead an issue when trying to run MapReduce jobs in Hadoop. There are several potential causes, so let’s cover each of them as they relate to PolyBase and hopefully one of these solves your issue.

Click through for four potential solutions to what ails you.

PolyBase and Hive Shim Errors

I ran into a problem with Hive 3 and PolyBase:

My initial plan was to google things. The specific error: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.

After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.

Click through for the solution.

PolyBase and Pushdown Limitations

I have a post covering something I learned about predicate pushdown against Hadoop in PolyBase:

Before I start, let’s talk about predicate pushdown for a moment. The gist of it is that when you have data in two sources, you have two options for combining the data:

1. Bring the data in its entirety from your remote source to your local target and work as though everything were in the local target to begin with. I’ll call this the streaming approach.

2. Send as much of your query’s filters, projections, and pre-conditions to the remote source, have the remote source perform some of the work, and then have the remote source send its post-operative data to the local target. Then, the local target once more treats this as though it were simply local data. This is the pushdown approach because you push down those predicates (that is, filters, projections, and pre-conditions).

Click through for the unfortunate finding and also vote up my UserVoice feature request if you want to see string columns as filters.

Querying CosmosDB with PolyBase

Hasan Savran shows how you can query CosmosDB with T-SQL statements:

SQL Server Polybase Services lets us to pull data from other data resources by using T-SQL queries. SQL Server 2019 introduces new connectors for Polybase services like Oracle, Teradata and MongoDB. In one of the SQL Server 2019 presentations from Bob Ward, I saw the CosmosDB logo when he was talking about the new connectors of SQL Server 2019. CosmosDB already has an ODBC driver and you can use it as a datasource for your Power BI, SSIS or SSMS. SQL Server 2019 makes this connection easier by using the Polybase services.
      
      In this post, I am going to show you how to configure SQL Server 2019 Polybase to connect Azure CosmosDB Mongo databases

I have a bit of a vested interest in this, so it is heartening to see people trying out PolyBase. It’s more heartening when they use it after they try it out.

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase:

Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini.

Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.

I’m used to the Hortonworks Data Platform, so this is a quick compendium of things I noticed to were different.

Data Transformation Tools In The Azure Space

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Keeping Polybase Tables In Sync With Biml

Ben Weissman combines two of my favorite things:

If you have started playing with polybase, you probably figured out by now, that – as awesome as it is – it’s still a bit of a pain to set it up and maintain external tables. There is a wizard in Azure Data Studio but it’s still under development, especially from a usability standpoint.
So what can be done about that? Well, we effectively looking for an easy way to read metadata from a relational database and automate T-SQL to mirror that metadata. HELLO?! Perfect usecase for Biml – which is NOT just for SSIS.
Let’s take a look at how that can be done…

If only Ben could have used F# instead of VB and VB with curly braces…

Technologies Surrounding Big Data Clusters In SQL Server 2019

Buck Woody has some long-term homework for people:

Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.

I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.

Click through for a list of some of the technologies in play.

Polybase And Azure Data Studio

Rajendra Gupta continues his series on Polybase in SQL Server 2019 with a look at Polybase integration in Azure Data Studio:

We have learned earlier that PolyBase in SQL Server 2019 Preview allows access to various data sources such as SQL Server, Oracle, MongoDB, Teradata, and ODBC based sources etc. Azure Data Studio SQL Server 2019 preview extension currently supports for SQL Server and Oracle data sources only from the External table wizard.

In this series, we will create an external table for SQL Server and explore some more features around it.

Launch Azure Data Studio and connect to the SQL Server 2019 preview instance. Right click on the database and launch ‘Create External Table’.

Rajendra also looks at some of the Polybase DMVs and the notion of predicate pushdown, which is critical to understand for writing Polybase queries which perform well.

Categories

April 2019
MTWTFSS
« Mar  
1234567
891011121314
15161718192021
22232425262728
2930