Press "Enter" to skip to content

Category: Architecture

Idempotence and Durable Execution

Jack Vanlightly does some thinking:

Determinism is a key concept to understand when writing code using durable execution frameworks such as Temporal, Restate, DBOS, and Resonate. If you read the docs you see that some parts of your code must be deterministic while other parts do not have to be.  This can be confusing to a developer new to these frameworks. 

This post explains why determinism is important and where it is needed and where it is not. Hopefully, you’ll have a better mental model that makes things less confusing.

Some of the examples Jack includes are pretty tricky, showing just how difficult it can be to ensure that multiple, independent systems are all on the same page.

Leave a Comment

Tips for Building a Data Warehouse

James Serra gets back to foundations:

I had a great question asked of me the other day and thought I would turn the answer into a blog post. The question is “I’m an experienced DBA in SQL Server/SQL DB, and my company is looking to build their first data warehouse using Microsoft Fabric. What are the best resources to learn how to do your first data warehouse project?”. So, below are my favorite books, videos, blogs, and learning modules to help answer that question:

Click through for James’s recommendations. I strongly agree with his advice to start with Ralph Kimball’s The Data Warehouse Toolkit, and frankly, I think a lot of James’s advice here is sound. The person asking focuses on Fabric, and there are plenty of Fabric-specific things to learn, but at the end of the day, modern data warehouses are still data warehouses.

Leave a Comment

Sales Tables and Normalization

Hans-Jürgen Schönig lays out an argument:

When creating data models in PostgreSQL (or any other powerful relational database), several common problems can arise that hinder effective database design. One frequent issue is the incorrect normalization of tables, leading to redundant data, inefficient querying capabilities, or even simply plain wrong semantics. 

First, I agree with the argument Hans-Jürgen is making regarding modeling sales data: it is important to understand what information the company is actually collecting and the domain and business rules associated with that data collection. The real world should shape and define your database model, not the other way around.

But where I fervently disagree is that this is “too much” or “over-optimistic” normalization because this is not normalization. Normalization is the lossless decomposition of data across entities, and normal forms 1-5 break down along two major lines: decompose functional dependencies (Boyce-Codd Normal Form), and decompose join dependencies (4th & 5th normal forms). In the example, there is clearly a loss of information from the wider model—the table definition Hans-Jürgen shows later on—so the split into product + sales tables does not adhere to 5th normal form.

The sales table does need some variant of product_unit_price because that was the price as of the time of sale, and prices can change for a variety of reasons: discounts because the product is on sale, adjustments due to inflation, increases in demand, supply chain issues, adjustments based on new marketing models, etc. For historical purposes, it is important that we capture the price as of the time of sale, as that will tell how much to refund if the customer returns with the product. It is also important for bookkeeping reasons, as a variety of taxes and reports depend upon that number.

But it’s also clear that the product table is reasonable: a product number (following the first rule of database architecture: anything labeled “number” is actually a string with non-numeric characters in it), a product name, and a product price. But this is the current sales price, which is different from the price at the time of sale. Those are two separate attributes and labeling them both “product_unit_price” leads to some of that confusion. They should both be in the data model.

Then, if I’m looking critically at the data model, I would ask why product name is in the sales data. Do product names change so frequently that we need to capture it on every sales record? Do managers have the authority to change the name of a product on the fly during a sale? Probably not and unless managers do have that discretion or the product name changes on a per-sale basis, it doesn’t belong there. There could be a functional dependency from product_number to product_name, such that each unique value of product_name has one and only one product_number. In that case, the product table should have product_name and the sales table should not.

But what if the product name does change over time and we need to track those changes? Then I’d (potentially) relabel product_name on the product table to current_product_name and create a 6th normal form table called something like product_name_history. This table would include the product_number, the product_name, and a time range over which that name was valid for that product. These timeframes should not overlap and, as long as the product exists, there should be an entry that corresponds with the appropriate name at any arbitrary time.

My final model would look something like:

CREATE TABLE product (
    product_number text PRIMARY KEY,
    product_name text NOT NULL,
    product_price numeric(10, 2) NOT NULL
);

CREATE TABLE product_name_history (
    -- Primary key would be on the combo of product_number and begin_time/interval
    product_number text NOT NULL,
    product_name text NOT NULL,
    -- Architected as an interval, implemented as start and end timestamps
    begin_time timestamptz NOT NULL,
    end_time timestamptz NOT NULL
);

CREATE TABLE sales (
    sales_time timestamptz NOT NULL,
    product_number text NOT NULL,
    sale_unit_price numeric(10, 2) NOT NULL,
    sale_units numeric(10, 2) NOT NULL
);

I did also rename “product_unit_price” and “product_units” to “sale_unit_price” and “sale_units” to make it clearer that these were associated with the sale itself. We could also get into things like sales orders versus sale order line items, additional attributes, etc. but this post is already long enough as it is. And again, this assumes that product name changes frequently enough and the different names are relevant enough for us to care about what the product name was at the time of a given sale. If not, we could remove the product_name_history table without otherwise changing the model.

Long story short, I agree with Hans-Jürgen that database design is important and understanding the business domain is critical for designing a good data model. But where I disagree is with the concept of over-normalization. That’s not a real thing. Improper attempts at normalization are certainly a real thing, as we see in this post, and it is important to understand the business rules as well as the rules of defining normal forms in order to prevent this.

Leave a Comment

Stream or Batch Ordering with Apache Iceberg

Jack Vanlightly shows some tradeoffs:

Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there can be a tug-of-war over the Iceberg table itself. In this post I am going to use two real-world systems, Apache Fluss (streaming tabular storage) and Confluent Tableflow (Kafka-to-Iceberg), as a case study for these tensions between stream and batch analytics.

Read on for a summary of how two opposite ideas can both be perfectly reasonable.

Leave a Comment

Azure Tenants and Microsoft Fabric

Andy Cutler begins a new series on Microsoft Fabric architecture:

Our Fabric Architecture journey starts with Azure Tenants (the kick-off blog in this series is here with a few jumping-off links to get started with thinking about Fabric Architecture). If you’re ready to spent time sketching out your Fabric Capacity planning, workspace strategy, domain topology, lakehouse/warehouse creation, data loading processes…you might want to stop for a minute and think about tenants. The question I’d like you to consider is What do I need to know when working with a single or a multi-tenancy approach? Let’s unpack this question because while it might sound like a simple list, it actually shapes your governance, scalability, and Fabric operational model. If you’re a seasoned Azure Architect veteran then you already know how to decide between single and multi-tenant cloud rollouts (also, please comment if you have anything to add please), if you work with Fabric/Data and don’t really dive into Azure architecture on a daily basis then please stick around. Hopefully this blog gets you thinking about single/multi-tenant architectures and the benefits/costs.

Read on for a dive into what tenants are, the benefits of single- versus multi-tenancy, and how it all ties into Fabric.

Comments closed

Star Schemas and Keys

Chris Barber provides a primer on the types of keys that are critical for a star schema:

Keys are a core component of star schema modelling; relationships between tables are built using the keys. This article covers:

  1. The main key types
  2. Star Schema diagrams
  3. Best practices when using Keys

An understanding of keys become increasingly important with more complex solutions. Not only do you need to understand them from a modelling perspective, but a common vernacular is required to communicate with team members.

It’s easier to think of the keys Chris describes in two separate classes rather than four unique items. Surrogate and natural keys are descriptors of a primary key (or any other unique/alternate key), after all.

Comments closed

Dealing with Many-to-Many Relationships in Power BI

Boniface Muchendu handles a many-to-many relationship:

Many-to-many relationships in Power BI are one of the most frequent challenges faced by new and intermediate users. These relationships can cause incorrect totals in visuals, confusing results, and slower report performance. In this guide, we’ll explore what many-to-many relationships in Power BI are, why they’re problematic, and how to fix them using the most effective methods available.

Click through for an enumeration of the problem as well as a couple of ways to resolve it.

Comments closed

Using Workspace Folders in Microsoft Fabric

Jon Vöge walks through a few folder strategies in Microsoft Fabric:

Every time I see a new Fabric Data Platform, I see a new way of using folders. Almost.

Ranging from no folders at all, to using folders to segregate item types, over folders for bronze/silver/gold layers, to even seeing setups of them being used for DEV/TEST/PROD.

I won’t claim all of these to be equally good. But the point is that there are many different approaches you may take.

Read on for a few strategies, including ones Jon would recommend avoiding.

Comments closed

Tokenization in SQL Server

Sebastiao Pereira demonstrates a combination of encryption and redirection to store sensitive data:

As privacy regulations tighten like General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standards (PCI DSS) organizations and more, there is an increased focus to protect sensitive information within databases. Tokenization is an option to adhere to those regulations. Let’s see how to implement SQL tokenization in SQL Server.

This is a reasonably clever solution, though if you need to search on any of the tokenized (i.e., encrypted and moved to a separate table) values, performance would be miserable. Even displaying the results for a moderately sized result set would run into serious performance issues. I suppose that if you, for some regulatory reason, need to keep these tokens stored elsewhere from the data, then you manage expectations the best you can.

Comments closed