Michael Verrilli has a post contrasting the Lambda and Kappa data architectures:
Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and is replaced with data in the batch views.
One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.
The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.
I haven’t seen much on the topic of Big Data architectures this year; it seems like it was a much more popular topic last year.
Comments closed