Randolph West talks about data lakes:
Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis these organizations handle all sorts of structured and unstructured data.
Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools, and even created new technologies, to manage data of this magnitude in a field called big data.
The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns us: if you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.
Incidentally, I’ll note that the term data swamp has a storied history here at Curated SQL.
Kaushik Nath explains what a Delta Lake is and why it is beneficial:
Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right?
All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.
Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.
Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:
As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”
Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.
Read the whole thing.
Derik Hammer gives us a definition of the data lake:
Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.
The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.
I would like to take this opportunity to remind readers that the Aristotelian opposite of the Data Lake is the Data Swamp. Derik uses this term as well and it makes me feel warm and fuzzy inside to see broad adoption of this term.
Ust Oldfield gives architectural hints on Azure Data Lake Store:
It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.
This is how you get data swamps (a term which I’m so happy is catching on). Read the whole thing.
Jen Stirrup has a great primer on data lakes and factors to consider before you jump into the idea:
The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.
As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.
I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.” It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes. The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant. It will also, to my estimate, be the more common version.
James Serra explains why you might want to use a data lake:
To refresh, a data lake is a landing zone, usually in Hadoop, for disparate sources of data in their native format. Data is not structured or governed on its way into the data lake. This eliminates the upfront costs of data ingestion, especially transformation. Once data is in the lake, the data is available to everyone. You don’t need a priority understanding of how data is related when it is ingested, rather, it relies on the end-user to define those relationships as they consume it. Data governorship happens on the way out instead of on the way in. This makes a data lake very efficient in processing huge volumes of data. Another benefit is the data lake allows for data exploration and discovery, to find out if data is useful or to create a one-time report.
I’m still working on a “data swamp” metaphor, in which people toss their used mattresses and we expect to get something valuable if only we dredge a little more. Nevertheless, read James’s article; data lakes are going to move from novel to normal over the next few years.
Can't find what you're looking for? Try refining your search: