Press "Enter" to skip to content

The Data Lake From 10,000 Feet

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions:  generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs.  By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.