Once we did that and I restarted all of the services, I ended up getting an interesting error message from SQL Server:
Msg 7320, Level 16, State 110, Line 2
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to JobSubmitter_SubmitJob: Error [org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=512
The error message is pretty clear: the Polybase service wants to create containers that are 1536 MB in size, but the maximum size I’m allowing is 512 MB. Therefore, the Polybase MapReduce operation fails.
Long story short, I needed enough RAM to be able to give 4 1/2 GB to YARN for creating MapReduce containers in order to run my query.
As a reminder, in order to allow predicate pushdown to occur, we need to hit a Hadoop cluster; we can’t use predicate pushdown on other systems like Azure Blob Storage. Second, we need to have a resource manager link set up in our external data source. Third, we need to make sure that everything is configured correctly on the Polybase side. But once you have those items in place, it’s possible to use the FORCE EXTERNALPUSHDOWN command like so:
There’s also discussion of preventing MapReduce job creation as well as a pushdown-related error I had received in the past.
Stream 2 sends along 27 MB worth of data. It’s packaging everything Polybase needs to perform operations, including JAR files and any internal conversion work that the Polybase engine might need to translate Hadoop results into SQL Server results. For example, the sqlsort at the bottom is a DLL that performs ordering using SQL Server-style collations, as per the Polybase academic paper.
Stream 2 is responsible for almost every packet from 43 through 19811; only 479 packets in that range belonged to some other stream (19811 – 43 – 18864 – 425 = 479). We send all of this data to the data node via port 50010.
If you love looking at Wireshark streams, you’ll love this post.
The dm_exec_external_work DMV tells us which execution we care about; in this case, I ended up running the same query twice, but I decided to look at the first run of it. Then, I can get step information from dm_exec_distributed_request_steps. This shows that we created a table in tempdb called TEMP_ID_14 and streamed results into it. The engine also created some statistics (though I’m not quite sure where it got the 24 rows from), and then we perform a round-robin query. Each Polybase compute node queries its temp table and streams the data back to the head node. Even though our current setup only has one compute node, the operation is the same as if we had a dozen Polybase compute nodes.
Click through for Wireshark-related fun.
Notice how 3bd shows up for pretty much all of these services. This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up. If you knew beforehand which node would host which service, you could modify the run.sh batch script that we discussed earlier and open those specific ports.
After assigning masters, we next have to define which nodes are clients in which clusters.
Click through for a screenshot-laden walkthrough.
HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.
HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages. It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.
Given HDInsight’s predilection toward WebHCat over WebHDFS, this does seem like a good thing to learn.
There are number of commands that you may need to use for administrating your cluster if you are one of the administrators for your cluster. If you are running your own personal cluster or Sandbox, these are also good to know and try. Do Not Try These In Production if you are not the owner and fully understand the dire consequences of these actions. These commands will be affecting the entire Hadoop cluster distributed file system. You can shutdown data nodes, add quotas to directories for various users and other administrative features.
Many of the commands in this list should be familiar if you know much about Linux or Unix administration, but there are some Hadoop-specific commands as well, like moveFromLocal and moveToLocal.
Polybase was first made available in Analytics Platform System in March 2013, and then in SQL Server 2016. The announcement at the PASS Summit was that by preview early next year, in addition to Hadoop and Azure blob storage, PolyBase will support Teradata, Oracle, SQL Server, and MongoDB in SQL Server 2016. And the Azure Data Lake Store will be supported in Azure SQL Data Warehouse PolyBase.
With SQL Server 2016, you can create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance (see PolyBase scale-out groups):
I’m excited for the future of Polybase and looking forward to vNext and vNext + 1 (for the stuff which they can’t possibly get done in time for vNext).
SCP.Net generates a zip file consisting of the topology DLLs and dependency jars.
It uses Java (if found in the PATH) or .net to generate the zip. Unfortunately, zip files generated with .net are not compatible with Linux clusters.
If you’re interesting in working with a Storm topology while writing .NET code, check this out.
A few months back, Microsoft started the Microsoft Professional Program for Data Science (note the program name change from Microsoft Professional Degree to Microsoft Professional Program, or MPP). This is online learning via edX.org as a way to learn the skills and get the hands-on experience that a data science role requires. You may audit any courses, including the associated hands-on labs, for free. However, to receive credit towards completing the data science track in the Microsoft Professional Program, you must obtain a verified certificate for a small fee for each of the ten courses you successfully complete in the curriculum. The course schedule is presented in a suggested order, to guide you as you build your skills, but this order is only a suggestion. If you prefer, you may take them in a different order. You may also take them simultaneously or one at a time, so long as each course is completed within its specified session dates.
Look for it sometime next year.