Data Lake Zones

Shannon Lowder walks us through a multi-zone approach to storing data in a data lake:

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

Worth reading the whole thing.

Related Posts

Automating Azure Data Lake Storage ACLs

Shannon Lowder shows how to automate Azure Data Lake Storage access control lists: Now that you have these, you can use a for each loop to set your permissions. foreach ($ACL in $ACLs) { write-host "Grant $useremail " $ACL[1] " access to " $ACL[0]; Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail […]

Read More

Data Lakes Aren’t New

Shannon Lowder reveals one of the deep, dark data lake secrets: Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation. The raw zone should be optimized for fast storage.  The goal is to get the data in as quickly as possible.  Don’t make any changes to this […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031