Why Data Lakes?

James Serra explains why you might want to use a data lake:

To refresh, a data lake is a landing zone, usually in Hadoop, for disparate sources of data in their native format.  Data is not structured or governed on its way into the data lake.  This eliminates the upfront costs of data ingestion, especially transformation.  Once data is in the lake, the data is available to everyone.  You don’t need a priority understanding of how data is related when it is ingested, rather, it relies on the end-user to define those relationships as they consume it.  Data governorship happens on the way out instead of on the way in.  This makes a data lake very efficient in processing huge volumes of data.  Another benefit is the data lake allows for data exploration and discovery, to find out if data is useful or to create a one-time report.

I’m still working on a “data swamp” metaphor, in which people toss their used mattresses and we expect to get something valuable if only we dredge a little more.  Nevertheless, read James’s article; data lakes are going to move from novel to normal over the next few years.

Related Posts

Long-Term Storage In Kafka

Jay Kreps shows us that you can use Kafka as a primary data store: The short answer is that it’s not insane, people do this all the time, and Kafka was actually designed for this type of usage. But first, why might you want to do this? There are actually a number of use cases, […]

Read More

Creating A Simple Kafka Streams Application

Bill Bejeck has built a simple Kafka Streams application for us: This blog post will quickly get you off the ground and show you how Kafka Streams works. We’re going to make a toy application that takes incoming messages and upper-cases the text of those messages, effectively yelling at anyone who reads the message. This […]

Read More

Categories

December 2015
MTWTFSS
« Nov Jan »
 123456
78910111213
14151617181920
21222324252627
28293031