Category: Architecture

Soft Deletes in SQL Server

Published 2021-01-13 by Kevin Feasel

Erik Darling has some thoughts on soft deletes:

Implementing soft deletes for an app that’s been around for a while can be tough. In the same way as implementing Partitioning can be tough to add in later to get data management value from (rebuilding clustered indexes on the scheme, making sure all nonclustered indexes are aligned, and all future indexes are too, and making sure you have sufficient partitions at the beginning and end for data movement).

Read the whole thing. Incidentally, this also ties well into a recent post by Erik about deleting into a different table. It can be easier to implement soft deletes as deleting from the current table and adding to an archive table. That gives you the benefits of keeping deleted data while not running into some of the problems Erik mentions. And if you want to undo a deletion? Delete from the archive table and insert back into the main table.

Comments closed

When to Use Event Sourcing

Published 2021-01-07 by Kevin Feasel

Vikas Hazrati takes us through the pros and cons of using event sourcing for a project:

You would always get a ton of literature on Event Sourcing and CQRS. The key question is WHEN do you use it? Under what circumstances? Is your problem really in need of ES?
I would not go into the details of what Event Sourcing and CQRS is. The industry stalwarts have covered that in adequate detail. This post delves into battle-tested scenarios on where we should have used and otherwise ignored ES.

Click through for an analysis of pros and cons, as well as some advice on what it all means.

Comments closed

Power BI End-to-End Diagram Update

Published 2021-01-05 by Kevin Feasel

Melissa Coates has an update for us:

An updated version of the Power BI End-to-End diagram is now available as of January 4, 2021.

This update is as of January 4th and includes things like Azure Purview. Melissa also shows how it has evolved over time, which is particularly interesting for Power BI given its rate of change.

Comments closed

Moving Away from the Lambda Architecture

Published 2020-12-11 by Kevin Feasel

Xiang Zhang and Jingyu Zhu talk about migrating a project away from the Lambda architecture:

The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing by using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. One of our features for Premium members on LinkedIn, Who Viewed Your Profile (WVYP), relied on a Lambda architecture for some time. The backend system supporting this feature had gone through a few architectural iterations in the past years: it started as a Kafka client processing a single Kafka topic, and eventually evolved to a Lambda architecture with more complicated processing logic. However, in an effort to pursue faster product iteration and lower operational overheads, we recently underwent a transition to make it Lambda-less. In this blog post, we’ll share some of the lessons learned in operating this system in the Lambda architecture, the decisions made in transitioning to Lambda-less, and the shifts necessary to undergo this transition.

When Lambda was first proposed back in 2015, it was intended as a compromise architecture trying to solve several important problems with the tools available in 2015 (well, 2013 and 2014—it was in a book, after all). I could definitely see the architecture fall into disuse within the next decade, not because it was at all bad, but because the world around it changed to the point that there is a better compromise available.

Comments closed

Durable Keys in Type 2 Dimensions

Published 2020-12-02 by Kevin Feasel

Martin Schoombee takes us through the idea of durable keys:

Also called an immutable or persisted key (I like durable better), a durable key is nothing more than a surrogate key (i.e. integer value or nonsensical number) used to identify a dimension member (company, employee, etc.) uniquely in a type-2 dimension. Confusing enough? It’s easier to explain with an example…

When I read Martin’s post, I kind of got it but said to myself, “How would I run this type of query more efficiently?” The thing that wasn’t clicking came from another article on the topic: you add the durable key to the fact as well as the current key. That way, you can join back to the Company dimension on CompanyKey if you want to get the company data as of the fact date, or you can join on DurableCompanyKey (and CurrentRecord = 1) to get the latest company data regardless of the fact date. Now that this is clear, I like the strategy a lot.

1 Comment

Delta Versus Lambda Architectures

Published 2020-11-23 by Kevin Feasel

Hector Leano compares the delta and lambda architectures:

Generally, a simple data architecture is preferable to a complex one. Code complexity increases points of failure, requires more compute to run jobs, adds latency, and increases the need for support. As a result, data pipeline performance degrades over time, increasing costs while decreasing productivity as your data engineers spend more time troubleshooting and downstream users wait longer for data refreshes.
Complexity was perceived as a necessary evil for the automated data pipelines feeding business reporting, SQL analytics, and data science because the traditional approach for bringing together batch and streaming data required a lambda architecture. While a lambda architecture can handle large volumes of batch and streaming data, it increases complexity by requiring different code bases for batch and streaming, along with its tendency to cause data loss and corruption. In response to these data reliability issues, the traditional data pipeline architecture adds even more complexity by adding steps like validation, reprocessing for job failures, and manual update & merge.

On the one hand, lambda was always intended to be a compromise architecture based on the tools of the time. On the other hand, take this with as many grains of salt as you need given that the post comes from the primary company responsible for delta.

Comments closed

Scaling ksqlDB, with Animations

Published 2020-11-23 by Kevin Feasel

Michael Drogalis walks us through scaling models with ksqlDB:

Software engineering memes are in vogue, and nothing is more fashionable than joking about how complicated distributed systems can be. Despite the ribbing, many people adopt them. Why? Distributed systems give us two things their single node counterparts cannot: scale and fault tolerance.
ksqlDB, the event streaming database, is built with a client/server architecture. You can run it with a single server, or you can cluster many servers together. Part 1 and part 2 in this series explained how a single server executes stateless and stateful operations. This post is about how these work when ksqlDB is deployed with many servers, and more importantly how it linearly scales the work it is performing—even in the presence of faults.
If you like, you can follow along by executing the example code yourself. ksqlDB’s quickstart makes it easy to get up and running.

Click through for well-animated examples.

Comments closed

PASS Summit Q&A: The Curated Data Platform

Published 2020-11-17 by Kevin Feasel

I answer some questions:

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.
During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Most of the questions were around document databases, so check them out.

Comments closed

Tips for Moving to a Multi-Tenant Setup

Published 2020-10-09 by Kevin Feasel

Adrian Hills continues a series on multi-tenant SQL Server:

What you knew a few years ago might differ significantly from the reality today, whether you started off with a single tenant system that you pivoted quickly to support multiple tenants, or you envisioned 10s of tenants and ended up with 1000s. Whatever that reality is, when you experience pain points around the 3 considerations I covered in part 1 of this series (security, maintainability, and scalability), it can lead to a need to change the multi-tenancy approach you’re using. Often, the biggest driver for change is around performance and scalability and typically tends to be related to a need to move from a less-isolated multi-tenancy approach (single database) to a more-isolated approach that supports the scaling out of workloads (multiple databases).
In the steps below, I’ll cover the general path you can follow to successfully make an architectural change like this to the database layer. As a provider of database performance monitoring and DataOps tools, SentryOne offers some tools that can come in handy along the way, so I’ll call those out as we go.

Click through for guidance.

Comments closed

Multi-Tenant Database Designs

Published 2020-09-28 by Kevin Feasel

Adrian Hills walks us through four database designs for multi-tenant data storage:

In my previous blog post, I talked about some of the key considerations around designing a multi-tenant system using SQL Server. There are several ways to implement multi-tenancy, and, as is often the case, there is no single “best” way but rather a range of options that each offer different trade-offs. The approach that is right for you depends on your objectives and needs for your specific environment. It’s important to consider which of these approaches best suit your requirements and goals based on the 3 core considerations from Multi-Tenancy with SQL Server, Part 1: security, maintainability (manageability), and scalability.
The following are the 4 approaches I will cover in this blog post:
1. Single database, shared schema
2. Single database, separate schema
3. Database per tenant
4. Multiple databases, multiple tenants per database, shared schema

I’ve worked with options 1, 3, and 4. Read on for Adrian’s thoughts. Ceteris paribus, my preference is 3. That said, I’ve worked in a situation where I migrated from 3 to 1 because there were thousands of customers, none of whom had more than hundreds of megabytes worth of data. 4 provides a good balance in that fashion, where you can bunch up smaller clients and give larger clients their own databases (and sometimes even servers). But if you’re going with options 2, 3, and 4, you probably want a central data warehouse which collects data across all four for internal use.

Comments closed