Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Related Posts

Data Lake Archive Tier

Ust Oldfeld looks at an important part of a data lake: The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake. Where Hot and Cool access tiers can be applied at a storage account level, […]

Read More

Fetching U-SQL Job Input And Output Paths

Matthew Hicks shows how to retrieve information on U-SQL input and output paths using Powershell: Each time you submit a U-SQL job, a job folder is created in your Azure Data Lake Store account. This folder contains useful debugging information about the job, including a file called the U-SQL algebra file. This is an XML […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930