Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

Related Posts

Data Lake Organization Tips

Melissa Coates has some great advice for people working with data lakes: Q: Partitioning by date is common. Where should the dates go in the folder hierarchy? Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder […]

Read More

Choosing Azure Data Lake Analytics Versus Azure Databricks

Ginger Grant helps us make the decision between using Azure Data Lake Analytics and Azure Databricks: Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31