A few months back, Microsoft started the Microsoft Professional Program for Data Science (note the program name change from Microsoft Professional Degree to Microsoft Professional Program, or MPP). This is online learning via edX.org as a way to learn the skills and get the hands-on experience that a data science role requires. You may audit any courses, including the associated hands-on labs, for free. However, to receive credit towards completing the data science track in the Microsoft Professional Program, you must obtain a verified certificate for a small fee for each of the ten courses you successfully complete in the curriculum. The course schedule is presented in a suggested order, to guide you as you build your skills, but this order is only a suggestion. If you prefer, you may take them in a different order. You may also take them simultaneously or one at a time, so long as each course is completed within its specified session dates.
Look for it sometime next year.
The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.
The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.
Creating an external table is pretty easy once you have the foundation prepared.
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.
Delimited text is exactly as it sounds: you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters). So let’s go through the options here. First, FORMAT_TYPE must be DELIMITEDTEXT. From there, we have a few FORMAT_OPTIONS. I mentioned FIELD_TERMINATOR, which is how we separate the values in a record. We can also use STRING_DELIMITER if there are quotes or other markers around our string values.
DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file. The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.
It feels like there’s a new Hadoop file format every day.
The majority of Spark is written in Scala (~80% of Spark core), which is a functional programming language. Functional programming languages emphasize functional purity (the output only depends on the inputs) and strive to avoid side-effects. One important component of most functional programming languages is their lazy evaluation. While it might seem odd that we would appreciate laziness from our computing tools, lazy evaluation is an effective way of ensuring computations are evaluated in the most efficient manner possible.
Lazy evaluation allows Spark SQL to highly optimize the queries. When a user submits a query to Spark SQL, Spark composes the components of the SQL query into a logical plan. The logical plan is basically a recipe Spark SQL creates in order to evaluate the desired query. Spark SQL then submits the logical plan to its highly optimized engine called Catalyst, which optimizes this plan into a physical plan of action that is executed inside Spark computation engine (a series of coordinating JVMs).
Read on for more description and code.
There are a couple of things I want to point out here. First, the Type is HADOOP, one of the three types currently available: HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).
Second, the Location is my name node on port 8020. If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs. In the Advanced tab, you can see the name node:
There are different options available for different sources, but this post is focused on Hadoop.
The short answer is, I’d get errors like the following when I try to run a MapReduce job:
Log Type: stderr
Log Upload Time: Thu Oct 27 00:16:23 +0000 2016
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
This was a rather vexing issue for a long time for me.
WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS. If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat. In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings. For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.
I don’t like the fact that WebHDFS is blocked, but at least WebHCat is functional.
For an introduction to this interesting Hadoop project, check out this article. Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase. It can use called through SparkSQL as well making for a very useful project. This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon. You can doMOLAP cubes and support many users with fast queries over billions of rows. Apache Kylin provides JDBC and ODBC drivers.
There are a few interesting options here.
For all usernames and principals, we will use the suffixes like Cluster14 for name-scalability.
- Active Directory setup:
- Create a new Organizational Unit for Hadoop users in AD say (OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM).
- Create a hdfs superuser : [email protected]
- Cloudera Manager requires an Account Manager user that has privileges to create other accounts in Active Directory. You can use the Active Directory Delegate Control wizard to grant this user permission to create other users by checking the option to “Create, delete and manage user accounts”. Create a user [email protected] in OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM as an Account Manager.
Install OpenLDAP utilities (openldap-clients on RHEL/Centos) on the host of Cloudera Manager server. Install Kerberos client (krb5-workstation on RHEL/Centos) on all hosts of the cluster. This step requires internet connection in Hadoop server. If there is no internet connection in the server, you can download the rpm and install.
This is absolutely worth the read.