Split Query Processing In Polybase

David DeWitt, et al, describe the Polybase engine in an academic article:

When compiling a SQL query that references an external table stored in an HDFS file, the PDW Engine Service contacts the Hadoop Namenode for information about the file. This information, combined with the number of DMS instances in the PDW cluster, is used to calculate the portion (offset and length) of the input file(s) each DMS instance should read from HDFS. This information is passed to DMS in the HDFS Shuffle step of the DSQL (distributed SQL) plan along with other information needed to read the file, including the file’s path, the location of the appropriate Namenode, and the name of the RecordReader that the bridge should use.

The system attempts to evenly balance the number of bytes read by each DMS instance. Once the DMS instances obtain split information from the Namenode, each can independently read the portion of the file it is assigned, directly communicating with the appropriate Datanodes without any centralized control.

This is a very clear paper which helps describe the core constructs of Polybase.  Highly recommended.

Dijkstra’s Algorithm

One of the most important algorithms for graphs is Dijkstra’s Algorithm.  Melissa Yan has a nice presentation on it,  which I recommend reading before the paper itself, which is only a couple pages long.

No Curation Today

Kevin Feasel

2017-01-02

Meta

Today is New Year’s Day observed, so instead of linking to blog posts, there will be a couple links to academic papers coming up.

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031