Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Related Posts

Data Lake Permissions

Melissa Coates has started a multi-part series on Azure Data Lake permissions.  She’s put up the first three parts already.  Part 1 covers the types of permissions available as well as some official documentation: (1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource. RBAC = Role-based access control. RBAC are […]

Read More

Defining A Data Lake

Derik Hammer gives us a definition of the data lake: Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in […]

Read More


November 2016
« Oct Dec »