The dm_exec_external_work DMV tells us which execution we care about; in this case, I ended up running the same query twice, but I decided to look at the first run of it. Then, I can get step information from dm_exec_distributed_request_steps. This shows that we created a table in tempdb called TEMP_ID_14 and streamed results into it. The engine also created some statistics (though I’m not quite sure where it got the 24 rows from), and then we perform a round-robin query. Each Polybase compute node queries its temp table and streams the data back to the head node. Even though our current setup only has one compute node, the operation is the same as if we had a dozen Polybase compute nodes.
Click through for Wireshark-related fun.
Notice how 3bd shows up for pretty much all of these services. This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up. If you knew beforehand which node would host which service, you could modify the run.sh batch script that we discussed earlier and open those specific ports.
After assigning masters, we next have to define which nodes are clients in which clusters.
Click through for a screenshot-laden walkthrough.
HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.
HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages. It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.
Given HDInsight’s predilection toward WebHCat over WebHDFS, this does seem like a good thing to learn.
There are number of commands that you may need to use for administrating your cluster if you are one of the administrators for your cluster. If you are running your own personal cluster or Sandbox, these are also good to know and try. Do Not Try These In Production if you are not the owner and fully understand the dire consequences of these actions. These commands will be affecting the entire Hadoop cluster distributed file system. You can shutdown data nodes, add quotas to directories for various users and other administrative features.
Many of the commands in this list should be familiar if you know much about Linux or Unix administration, but there are some Hadoop-specific commands as well, like moveFromLocal and moveToLocal.
Polybase was first made available in Analytics Platform System in March 2013, and then in SQL Server 2016. The announcement at the PASS Summit was that by preview early next year, in addition to Hadoop and Azure blob storage, PolyBase will support Teradata, Oracle, SQL Server, and MongoDB in SQL Server 2016. And the Azure Data Lake Store will be supported in Azure SQL Data Warehouse PolyBase.
With SQL Server 2016, you can create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance (see PolyBase scale-out groups):
I’m excited for the future of Polybase and looking forward to vNext and vNext + 1 (for the stuff which they can’t possibly get done in time for vNext).
SCP.Net generates a zip file consisting of the topology DLLs and dependency jars.
It uses Java (if found in the PATH) or .net to generate the zip. Unfortunately, zip files generated with .net are not compatible with Linux clusters.
If you’re interesting in working with a Storm topology while writing .NET code, check this out.
A few months back, Microsoft started the Microsoft Professional Program for Data Science (note the program name change from Microsoft Professional Degree to Microsoft Professional Program, or MPP). This is online learning via edX.org as a way to learn the skills and get the hands-on experience that a data science role requires. You may audit any courses, including the associated hands-on labs, for free. However, to receive credit towards completing the data science track in the Microsoft Professional Program, you must obtain a verified certificate for a small fee for each of the ten courses you successfully complete in the curriculum. The course schedule is presented in a suggested order, to guide you as you build your skills, but this order is only a suggestion. If you prefer, you may take them in a different order. You may also take them simultaneously or one at a time, so long as each course is completed within its specified session dates.
Look for it sometime next year.
The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.
The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.
Creating an external table is pretty easy once you have the foundation prepared.
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.
Delimited text is exactly as it sounds: you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters). So let’s go through the options here. First, FORMAT_TYPE must be DELIMITEDTEXT. From there, we have a few FORMAT_OPTIONS. I mentioned FIELD_TERMINATOR, which is how we separate the values in a record. We can also use STRING_DELIMITER if there are quotes or other markers around our string values.
DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file. The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.
It feels like there’s a new Hadoop file format every day.