External Tables To Hadoop

I have a post looking at creating external tables in Polybase to hit a Hadoop folder:

The DATA_SOURCE and DATA_FORMAT options are easy:  pick you external data source and external file format of choice.

The last major section deals with rejection.  We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records.  The Hadoop mindset is that there are two places in which you can perform data quality checks:  in the original client (pushing data into HDFS) and in any clients reading data from HDFS.  To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table.  For example, let’s say that we have a Age column for each of our players, and that each age is an integer.  If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail.  Polybase rejects this row (removing it from the result set stream) and increments a rejection counter.  What happens next depends upon the reject options.

Creating an external table is pretty easy once you have the foundation prepared.

Related Posts

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka: When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message A before crashing […]

Read More

Avro Schemas In Kafka

Stephane Maarek explains the value of using Apache Avro as a schema structure for your Kafka topics: Avro has support for primitive types ( int, string, long, bytes, etc…), complex types (enum, arrays, unions, optionals), logical types (dates, timestamp-millis, decimal), and data record (name and namespace). All the types you’ll ever need. Avro has support for embedded documentation. Although documentation is optional, in my workflow I […]

Read More


November 2016
« Oct Dec »