Press "Enter" to skip to content

Data Lakes

Jen Stirrup has a great primer on data lakes and factors to consider before you jump into the idea:

The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.

As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.

I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.”  It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes.  The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant.  It will also, to my estimate, be the more common version.