Data Lake Zones

Shannon Lowder walks us through a multi-zone approach to storing data in a data lake:

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

Worth reading the whole thing.

Related Posts

Defining A Data Lake

Derik Hammer gives us a definition of the data lake: Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in […]

Read More

Comparing Data Lake Job Runs

Yanan Cai shows how to compare stats on different executions of a job: Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031