Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

Related Posts

Comparing Data Lake Job Runs

Yanan Cai shows how to compare stats on different executions of a job: Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what […]

Read More

Data Lake Zones

Melissa Coates walks us through the different layers of a data lake: As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized…in your data lake? The most important aspect of organizing a data lake is optimal data retrieval. Click through for […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31