Press "Enter" to skip to content

Raw Data in the Data Lake

Steve Cardella uses wrestling as a metaphor where I would have used sewage:

Raw. Unfiltered. Data. The raw zone – it’s the dark underbelly of your data lake, where anything can happen. The CRM data just body-slammed the accounting data, while the HR data is taking a chair to the marketing data. It’s all a rumble for the championship belt, right? Oh, wait – we’re talking data lakes. Sorry. If the raw zone isn’t where data goes to duke it out, then what is the raw zone of a data lake? How should it be set up?

First, let’s take a time-out to give some context. A data lake is a central storage pool for enterprise data; we pour information into it from all kinds of sources. Those sources might include anything from databases to raw audio and video footage, in unstructured, semi-structured, and structured formats. A data warehouse, conversely, only houses structured data. The data lake is divided into one or more zones of data, with varying degrees of transformation and cleanliness (see this video for more: Data Lake Zones, Topology, and Security). The raw zone is the foundation upon which all other data lake zones are built.

Read on to understand the importance of raw data in a data lake, and the equal importance of making sure end users don’t see that stuff very often. Also, Steve gets bonus points for using my favorite term for the Aristotelian opposite of a data lake: the data swamp.