Press "Enter" to skip to content

Category: Architecture

Defining Technical Debt

Louis Davdison takes a favorite phrase of many an IT person:

Ah, the term “technical debt.” The implication of it is that you have this wonderful idea, this glorious design, and for time/money reasons, you said “we can’t achieve this.” I am not sure there has ever been a project that didn’t have technical debt. It happens in software, it happens in the real world. You probably have technical debt in your house, and huge companies like Disney make these glorious plans that never quite get finished.

Click through for a link to Louis’s video. As for my own definition of technical debt, I wrote a blog post about it a while back. As of this moment, the only part I might debate myself on is whether “It was a good decision at the time, but times have changed” is really technical debt or if it’s something else. From an ontological perspective, it’s probably a different category of thing. But from the standpoint of a practitioner with a code base or infrastructure as it is, I don’t know that it matters all that much whether we call it “technical debt” or “the ever-changing sands of time ruining all that is great.” Though definitely pull out the latter in a meeting when trying to explain to a PM why you need 40 hours of dev time to rewrite some code.

Leave a Comment

Thoughts on the Future of MySQL

Dave Stokes shares some thoughts:

 I am not intentionally trying to upset anyone with this blog post or minimize the efforts of many brilliant people whom I admire. However, I connected with several people over the 2025 holidays who all had the same question: What is the future of MySQL? At the upcoming FOSDEM conference, several events will discuss this subject and push a particular solution.  And in several ways, they are all wrong.

Oracle has not been improving the community edition for a long time now. They have laid off many of their top performers in the MySQL group. We got almost a good decade and a half out of Oracle’s stewardship of the “world’s most popular database”, and we should be thankful for that. However, now that time is over, it is time to consider future options that will involve no updates, CVEs, or innovation for what is the MySQL Community Edition.

Read on for a few possibilities, focusing on the open-source database market.

Leave a Comment

BIGINT Serial Columns in PostgreSQL

Elizabeth Christensen lays out an argument:

Lots of us started with a Postgres database that incremented with an id SERIAL PRIMARY KEY. This was the Postgres standard for many years for data columns that auto incremented. The SERIAL is a shorthand for an integer data type that is automatically incremented. However as your data grows in size, SERIALs and INTs can run the risk of an integer overflow as they get closer to 2 Billion uses.

We covered a lot of this in a blog post The Integer at the End of the Universe: Integer Overflow in Postgres a few years ago. Since that was published we’ve helped a number of customers with this problem and I wanted to refresh the ideas and include some troubleshooting steps that can be helpful. I also think that BIGINT is more cost effective than folks realize.

Click through for Elizabeth’s argument. I’d say that this is very similar for SQL Server, where I’m more inclined to create a BIGINT identity column, especially because I almost automatically apply page-level compression to tables so there’s not really a downside to doing this. Identity columns don’t have a domain, so there’s no domain-specific information like you’d get with a column such as Age; and with page-level compression, you’re not wasting space.

Leave a Comment

In Support of Ugly Code

John Cook defends (some) ugly code:

Ugly code may be very valuable, depending on why it’s ugly. I’m not saying that it’s good for code to be ugly, but that code that is already ugly may be valuable.

That something is ugly is typically a visceral reaction. But I try to tease out why I think code is ugly, as it can be for several reasons.

  • It’s not formatted well or consistently. That’s an easy fix for the most part.
  • Naming is inconsistent or contradictory. Depending on the tooling, this is a reasonably easy fix.
  • The logic is convoluted to me. This is where things get tricky. Is it convoluted because I don’t understand what’s going on? Or is it convoluted because the person who developed or maintained it didn’t understand something important? If it’s the former, I try (“try” being the operative word here) to bite my tongue and dig in deeper to understand it better. But if it’s the latter, I think that’s fair game for refactoring.

Younger me was all about rewriting and removing nasty, inefficient, ugly code. But older me realizes that only some nasty, inefficient, ugly code is actually bad. I still will heartily argue that code is a liability and that most code bases could make do with a spring cleaning. But it has to come from a place of understanding first. I have a lot more on the topic of technical debt in an essay I wrote a few years ago. And I did purposefully cut myself off at one point to be cute.

Leave a Comment

Choosing a Vector Database

Joe Sack has some advice:

Vector search has become a standard approach for semantic search and RAG. Whether you’re evaluating a dedicated vector database, SQL Server 2025, a Postgres extension like pgvector, or an in-memory library, there are certain production realities worth planning for.

Admittedly, my vector database decision boiled down to “What can I actually get to work in my non-internet-connected on-premises environment where everything is locked down to the point that bringing in new software is a major hassle?” That quickly narrowed down the set of viable options.

Comments closed

Reverse Engineering a Physical Model Diagram with Redgate Data Modeler

Steve Jones gives the new Regate acquisition a try:

I recently wrote about a logical diagram with Redgate Data Modeler. That was interesting, but creating all the objects is a pain. I decided to try creating a physical diagram from an existing database. This post looks at the experience.

Click through for Steve’s thoughts. I appreciate how he’s willing to call out the pain points that exist in the product today.

Comments closed

Common Star Schema Mistakes

Ben Richardson gets back to basics:

Sometimes the culprit isn’t actually your DAX, it’s your data model.

Star schema mistakes are incredibly common in Power BI, and really hard to track down.

When your data model isn’t a clean star schema, you end up with broken filters, confusing relationships and slow visuals.

It’s important to get it right from the start! So we broke down the top 10 most common mistakes people make, how to identify them and how to fix them!

This is where reviewing (or reading) Ralph Kimball’s Data Warehouse Toolkit can save you a lot of time and stress. The Microsoft data analytics world is predicated so heavily on Kimball-style dimensional modeling that the choices tend to be building a proper star schema up-front or spend processing and developer time trying to fix it in post-production using DAX or trickery.

Comments closed

Idempotence and Durable Execution

Jack Vanlightly does some thinking:

Determinism is a key concept to understand when writing code using durable execution frameworks such as Temporal, Restate, DBOS, and Resonate. If you read the docs you see that some parts of your code must be deterministic while other parts do not have to be.  This can be confusing to a developer new to these frameworks. 

This post explains why determinism is important and where it is needed and where it is not. Hopefully, you’ll have a better mental model that makes things less confusing.

Some of the examples Jack includes are pretty tricky, showing just how difficult it can be to ensure that multiple, independent systems are all on the same page.

Comments closed

Tips for Building a Data Warehouse

James Serra gets back to foundations:

I had a great question asked of me the other day and thought I would turn the answer into a blog post. The question is “I’m an experienced DBA in SQL Server/SQL DB, and my company is looking to build their first data warehouse using Microsoft Fabric. What are the best resources to learn how to do your first data warehouse project?”. So, below are my favorite books, videos, blogs, and learning modules to help answer that question:

Click through for James’s recommendations. I strongly agree with his advice to start with Ralph Kimball’s The Data Warehouse Toolkit, and frankly, I think a lot of James’s advice here is sound. The person asking focuses on Fabric, and there are plenty of Fabric-specific things to learn, but at the end of the day, modern data warehouses are still data warehouses.

Comments closed

Sales Tables and Normalization

Hans-Jürgen Schönig lays out an argument:

When creating data models in PostgreSQL (or any other powerful relational database), several common problems can arise that hinder effective database design. One frequent issue is the incorrect normalization of tables, leading to redundant data, inefficient querying capabilities, or even simply plain wrong semantics. 

First, I agree with the argument Hans-Jürgen is making regarding modeling sales data: it is important to understand what information the company is actually collecting and the domain and business rules associated with that data collection. The real world should shape and define your database model, not the other way around.

But where I fervently disagree is that this is “too much” or “over-optimistic” normalization because this is not normalization. Normalization is the lossless decomposition of data across entities, and normal forms 1-5 break down along two major lines: decompose functional dependencies (Boyce-Codd Normal Form), and decompose join dependencies (4th & 5th normal forms). In the example, there is clearly a loss of information from the wider model—the table definition Hans-Jürgen shows later on—so the split into product + sales tables does not adhere to 5th normal form.

The sales table does need some variant of product_unit_price because that was the price as of the time of sale, and prices can change for a variety of reasons: discounts because the product is on sale, adjustments due to inflation, increases in demand, supply chain issues, adjustments based on new marketing models, etc. For historical purposes, it is important that we capture the price as of the time of sale, as that will tell how much to refund if the customer returns with the product. It is also important for bookkeeping reasons, as a variety of taxes and reports depend upon that number.

But it’s also clear that the product table is reasonable: a product number (following the first rule of database architecture: anything labeled “number” is actually a string with non-numeric characters in it), a product name, and a product price. But this is the current sales price, which is different from the price at the time of sale. Those are two separate attributes and labeling them both “product_unit_price” leads to some of that confusion. They should both be in the data model.

Then, if I’m looking critically at the data model, I would ask why product name is in the sales data. Do product names change so frequently that we need to capture it on every sales record? Do managers have the authority to change the name of a product on the fly during a sale? Probably not and unless managers do have that discretion or the product name changes on a per-sale basis, it doesn’t belong there. There could be a functional dependency from product_number to product_name, such that each unique value of product_name has one and only one product_number. In that case, the product table should have product_name and the sales table should not.

But what if the product name does change over time and we need to track those changes? Then I’d (potentially) relabel product_name on the product table to current_product_name and create a 6th normal form table called something like product_name_history. This table would include the product_number, the product_name, and a time range over which that name was valid for that product. These timeframes should not overlap and, as long as the product exists, there should be an entry that corresponds with the appropriate name at any arbitrary time.

My final model would look something like:

CREATE TABLE product (
    product_number text PRIMARY KEY,
    product_name text NOT NULL,
    product_price numeric(10, 2) NOT NULL
);

CREATE TABLE product_name_history (
    -- Primary key would be on the combo of product_number and begin_time/interval
    product_number text NOT NULL,
    product_name text NOT NULL,
    -- Architected as an interval, implemented as start and end timestamps
    begin_time timestamptz NOT NULL,
    end_time timestamptz NOT NULL
);

CREATE TABLE sales (
    sales_time timestamptz NOT NULL,
    product_number text NOT NULL,
    sale_unit_price numeric(10, 2) NOT NULL,
    sale_units numeric(10, 2) NOT NULL
);

I did also rename “product_unit_price” and “product_units” to “sale_unit_price” and “sale_units” to make it clearer that these were associated with the sale itself. We could also get into things like sales orders versus sale order line items, additional attributes, etc. but this post is already long enough as it is. And again, this assumes that product name changes frequently enough and the different names are relevant enough for us to care about what the product name was at the time of a given sale. If not, we could remove the product_name_history table without otherwise changing the model.

Long story short, I agree with Hans-Jürgen that database design is important and understanding the business domain is critical for designing a good data model. But where I disagree is with the concept of over-normalization. That’s not a real thing. Improper attempts at normalization are certainly a real thing, as we see in this post, and it is important to understand the business rules as well as the rules of defining normal forms in order to prevent this.

Comments closed