Architecting Semi-Structured Data Solutions

James Serra gives four architectural scenarios for handling large quantities of semi-structured data:

An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.

Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.

Check it out for a high-level architectural view of contemporary warehousing choices.  I prefer having both systems in play:  the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.

Related Posts

Event Sourcing On Kafka

Adam Warski shows how you can use Apache Kafka as your event sourcing data source: There’s a number of great introductory articles, so this is going to be a very brief introduction. With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these […]

Read More

The Basics Of Kafka Security

Stephane Maarek has a nice post covering some of the basics of securing an Apache Kafka cluster: Once your Kafka clients are authenticated, Kafka needs to be able to decide what they can and cannot do. This is where Authorization comes in, controlled by Access Control Lists (ACL). ACL are what you expect them to be: […]

Read More


May 2016
« Apr Jun »