Data Lakes And Data Swamps

Randolph West talks about data lakes:

Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis these organizations handle all sorts of structured and unstructured data.

Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools, and even created new technologies, to manage data of this magnitude in a field called big data.

The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns usif you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.

Incidentally, I’ll note that the term data swamp has a storied history here at Curated SQL.

Related Posts

Dataflows In Power BI

James Serra gives us a preview of Power BI Dataflows: In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize […]

Read More

Overview: U-SQL Database Projects

Zach Stagers gives us an overview of the new U-SQL Database Project structure: Source Control The projects integrates much more nicely with TFS than the older “U-SQL Project” does. It actually gives you the icons (padlock, check mark, etc..) in the solution explorer, so it actually looks like it’s under source control! Something that I’d really hoped […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031