Press "Enter" to skip to content

The Flexible Data Lake

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.

Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.

Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.

Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.

The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.