External Tables To Hadoop

I have a post looking at creating external tables in Polybase to hit a Hadoop folder:

The DATA_SOURCE and DATA_FORMAT options are easy:  pick you external data source and external file format of choice.

The last major section deals with rejection.  We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records.  The Hadoop mindset is that there are two places in which you can perform data quality checks:  in the original client (pushing data into HDFS) and in any clients reading data from HDFS.  To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table.  For example, let’s say that we have a Age column for each of our players, and that each age is an integer.  If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail.  Polybase rejects this row (removing it from the result set stream) and increments a rejection counter.  What happens next depends upon the reject options.

Creating an external table is pretty easy once you have the foundation prepared.

Related Posts

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930