Press "Enter" to skip to content

Working with ADLS Gen 2 in Power Query

Marco Russo takes us through some ways to optimize performance when working with Azure Data Lake Storage Gen 2 from Power Query:

With Power Query you can apply filters to the list obtained by the File System View option, thus restricting the access to only those files (or a single file) you are interested in. However, there is no query folding of this filter. What happens is that every time you refresh the data source, the list of all these files is read by Power Query; the filters in M Query to the folder path and the file name are then applied to this list only client-side. This is very expensive because the entire list is also downloaded when the expression is initially evaluated just to get the structure of the result of the transformation.

A better way to manage the process is to specify in the URL the complete folder path to traverse the hierarchy, and get only the files that are interesting for the transformation – or the exact path of the file is you are able to do that. For example, the data lake I used had one file for each day, stored in a folder structure organized by yyyy\mm, so every folder holds up to 31 files (one month).

Read on for more advice in this vein.