Lambda is an organic result of the limitations of existing tools. Distributed systems architects and developers commonly criticize its complexity – and rightly so. Those of us that have worked extensively in Extract-Transform-Load and symmetric multiprocessing systems see red flags when code is replicated in multiple services. Ensuring data quality and code conformity across multiple systems, whether massively parallel processing (MPP) or symmetrically parallel system (SMP), has the same best practice: the least amount of times you reproduce code is always the correct number of times.
We reproduce code in lambda because different services in MPP systems are better at different tasks. The maturity of tools historically hasn’t allowed us to process streams and batch in a single tool. This is starting to change, with Apache Spark emerging as a single preferred compute service for stream and batch querying, hence the timing of Azure Databricks. However, on the storage side, what was meant to be an immutable store that is the data lake in practice, can become the dreaded swamp when governance or testing fails; which is not uncommon. A fundamentally different assumption to how we process data is required to combat this degradation. Enter: the kappa architecture, which we’ll examine in the next post of this series.