Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Related Posts

Azure Data Lake Analytics Pipelines

Yan Li notes that Azure Data Lake Analytics now offers the ability to manage pipelines: To make it easier to manage and understand jobs, ADLA now captures the pipeline and recurrence information for each job. This information can be used to connect and organize jobs belonging to the same pipeline or recurring instances. As shown in Fig […]

Read More

Generating U-SQL Extract Scripts From Visual Studio

Yanan Cai shows a GUI for creating U-SQL EXTRACT scripts via Azure Data Lake Tools for Visual Studio: One of U-SQL’s core capabilities is to be able to schematize unstructured data on the fly without having to create a metadata object for it. This capability is provided by the EXTRACT expression that will invoke either […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930