Defining A Data Lake

Derik Hammer gives us a definition of the data lake:

Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.

The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.

I would like to take this opportunity to remind readers that the Aristotelian opposite of the Data Lake is the Data Swamp.  Derik uses this term as well and it makes me feel warm and fuzzy inside to see broad adoption of this term.

Related Posts

Data Lake Organization Tips

Melissa Coates has some great advice for people working with data lakes: Q: Partitioning by date is common. Where should the dates go in the folder hierarchy? Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder […]

Read More

Choosing Azure Data Lake Analytics Versus Azure Databricks

Ginger Grant helps us make the decision between using Azure Data Lake Analytics and Azure Databricks: Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had […]

Read More

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031