Press "Enter" to skip to content

Category: Architecture

The Pain of Views in PostgreSQL

Radim Marik digs into views:

VIEWs should be the cleanest abstraction SQL, and therefore Postgres, has on offer. I love the concept. The promise of decoupling logical intent from physical storage is perfect on paper. In practice, few things in the database world trigger such a heated debate or carry as much historical baggage. VIEWs mix big promises with false hopes, and the promises rarely survive contact with production.

The appeal is straightforward. Abstract “active customer” once and reuse it everywhere. Every query, report and dashboard uses the same definition. The “active customer” then becomes the foundation of a “customer orders” view, which in turn powers an operational “customer summary” view.

Some of these pain points are PostgreSQL-specific, but others (especially around the performance of nested views) resonates with SQL Server as well.

Leave a Comment

Cold Storage Data Archival

Brent Ozar answers a question:

At least once a month, I get this question from a client:

We have big data, and we want to save money. We want to move the older data down to some kind of archives that will cost us less.

I ask exactly one followup question:

Are you willing to modify the app that queries the data, or do you want the users to still be able to query the old data in their existing screens and reports?

Brent recommends trying a linked server approach. I’d also toss PolyBase in to the mix, though it’s going to have similar pros and cons to linked servers in this specific scenario. I have an old post on cold storage with PolyBase but the mechanisms haven’t changed much. I do have some sample scripts in my PolyBase talk’s GitHub repo, though fair warning that these are going away soon as I rebuild the talk for 2025.

Leave a Comment

Defining the Fabric Ontology

Mike Donnelly explains a term:

The short version: Fabric Ontology is the semantic backbone of Microsoft Fabric. It’s the layer that defines what your business data actually means.

If you’ve ever worked in a large organization, you know the problem. One team calls them “Customers,” another calls them “Clients,” and a third calls them “Account Holders.” Without a shared meaning, your analytics become a mess of conflicting vocabularies. An ontology is just a structured way of naming things and describing their relationships so everyone—and every tool—is using the same dictionary.

I think this is all correct, but I think there’s something more to ontologies than that. At least in the Palantir world, the ontology is not just the business definitions and concepts, but it’s also the actions you can take against that data. In other words, you might have Customers, Clients, and Account Holders. You can add a new customer, update the customer details, send a welcome to a new account holder, etc. Each of these actions is baked into the ontology as well. That’s what makes it different from simply defining where the data lives and how we got it in its current shape.

Leave a Comment

The Power of Database Projects

Andy Brownsword is sold:

If you’ve been using Database Projects for simply maintaining copies of your database objects – like I used to – then you’ve been missing out on the power of its deployments. Initially I was sceptical about how it could effectively perform upgrades but after seeing them land in SSMS last month I wanted to revisit them as a means of actual deployment.

My scepticism was completely misplaced, so if you haven’t touched Database Projects before – or had similar concerns as me – I want to demonstrate 3 features which make them not just good, but amazing for deployments.

Click through for those features. I will say that for straightforward databases, the database project deployment process is pretty good. Where it falls apart is when you have a large number of cross-database dependencies, especially if there are mutual cross-database dependencies: DB1.dbo.sp1 needs DB2.dbo.Table2, and DB2.dbo.view2 references DB1.dbo.Table1. In that case, the workaround is so annoying and essentially comes down to “have three separate database projects, one for DB1, one for DB2, and one for a scalled-down version of DB1/DB2 without the dependencies, and then use that to inject into the other DB.” Which does kind of work, yeah, but now you’re maintaining even more. And once you get to dozens of dependencies and lots of cross-database queries? Yeah, forget about it.

Leave a Comment

Don’t Use FLOAT for Currency Data

Vlad Drumea has a public service announcement:

In this post I provide some examples as to why FLOAT isn’t the best option for storing exact financial data in SQL Server.

During the last couple of years I’ve seen quite a few cases of FLOAT being used to store financial data in SQL Server.
This isn’t ideal due to how the FLOAT data type behaves in SQL Server, and the official documentation makes a clear note of that.

I’m crotchety enough about this to say that you shouldn’t use any sort of decimal data type (DECIMAL, MONEY, etc.) when dealing with currency data in important financial contexts, either. I reach for BIGINT and divide by 100 (or 10,000 if you keep to four spots after the decimal) for the presentation layer.

For cases in which my data isn’t the source system of record for financial transactions, DECIMAL is fine. I don’t generally like MONEY as a data type in any circumstance, to be honest.

Comments closed

Defining a Data Contract

Buck Woody becomes accountable:

A businessperson pulls a report from a data warehouse, runs the same query they’ve used for two years, and gets a number that doesn’t match what the finance team presented at yesterday’s board meeting. Nobody changed the report. Nobody changed the dashboard. But somewhere upstream, an engineering team renamed a field, shifted a column type, or quietly altered the logic in a pipeline, and nobody thought to mention it because there was no mechanism to mention it.

While we think of this as an engineering failure, it’s more of an implied contract failure. More precisely, it’s the absence of a formal contract. Data contracts are one of the most practical tools a data organization can adopt, and one of the most underused. The idea is not complicated: a data contract is a formal, enforceable agreement between the team that produces data and the team that consumes it. It defines what the data looks like, what quality standards it must meet, who owns it, and what happens when something changes. Think of it as the API layer for your data, the same guarantee a software engineer expects from a well-documented endpoint, applied to the datasets and pipelines your business depends on. This post is about why that matters at the CDO level and how to get them put in place.

Click through to learn more about what data contracts are and why they are useful. This post stays at the architectural level rather than the practitioner level, but lays out why it’s important to think about these sorts of things.

Comments closed

The Death of SQL Server, Yet Again

Rebecca Lewis states that rumors of SQL Server’s demise have been greatly exaggerated:

LinkedIn is doing that thing again. ‘Is SQL Server Dead?’ posts are getting traction, the comment sections are lit, and I’m sure a lot of managers out there are forwarding one of them to a DBA with a question mark and no context. It happens every few years. NoSQL was going to kill it. The cloud was going to kill it. Now it’s AI and Fabric. I’ve been doing this for 28 years. Let me save you some time.

No. SQL Server is not dead or even dying. But something is changing — and we need to understand what.

Read on for Rebecca’s take. At the very least, I strongly agree with the general thrust of her argument. And, of course, I would be remiss if I didn’t mention Feasel’s Law here.

What makes this tricky is that there’s definitely more impetus from Microsoft to get people in Azure and using SQL Server-derived products up there. And the stability of the product has definitely taken a hit in the past couple of versions, something I chalk up to the organization removing QA teams back in about 2018. I do think the biggest competition for SQL Server is open-source relational databases, particularly PostgreSQL. At the same time, it’s not just a matter of “Let’s take a weekend and replace database management systems.” For that reason (among others), I do expect to see SQL Server around for quite a while.

And do check out Andy Yun’s comment. It links to a paper I’ve had on my to-read list that a Shop Talk viewer gave me while I was ranting on a somewhat similar topic.

Comments closed

Goodhart’s Law and KPIs

Alexander Arvidsson talks metrics:

In 1974, the Soviet Union set production quotas for nails measured in tons.

Factories responded rationally. They produced a small number of enormous, completely useless nails. Problem solved. Quota met.

Moscow reconsidered and switched to quotas measured in number of nails.

Factories responded rationally. They produced millions of tiny, flimsy nails—too small to hold anything together.

The Soviet nail industry was, by any measurable standard, thriving. It was also producing almost nothing of value.

Click through for a detailed description of the problem and Alexander’s push for an answer. I typically summarize Goodhart’s Law in my own words as, any system can be gamed. I think Alexander’s tips on good measurement architecture are sound (and he even mentions gaming systems here), but even with all of those in place, you can still end up with a monkey’s paw of a KPI suite. Even the example of “ensure that users who log in are getting measurable value from the product, as evidenced by X” falls apart in the end because we now maximize for X while hollowing out Y and Z, because nobody’s watching those, as we’re focused on X-maximization. Like, say, “ensure that users who long in are getting measurable value from the product, as evidenced by average order value.” Now we maximize for AOV, streamlining the experience for people making large orders and relentlessly pushing more cart items at the user. The other points aim to mitigate the worst of these results, but it’s an off-the-cuff example of how no system is perfect. But some can still be superior.

Comments closed

Page-Level Caching in PostgreSQL with PAX

Laetitia Avrot has a two-parter on PAX. The first part is from a couple of months ago and sets the stage:

Picture this: you walk into a library. Each book is a database record. PostgreSQL’s traditional storage (NSM – N-ary Storage Model) stores complete books on each shelf: chapter 1, chapter 2, chapter 3, all bound together.

Here’s the problem that keeps me up at night: when you need only chapter 3 from 1,000 books, you must pull each complete book off the shelf, flip through chapters 1 and 2 that you’ll never read, grab chapter 3, and move to the next book.

You’re wasting time. You’re wasting energy. You’re wasting cache bandwidth.

But it’s not all roses:

PAX looks elegant on paper: minipages, cache locality, column-oriented access inside an 8KB page. But the moment you ask how this actually would work with Postgres, the complexity arrives fast. NULL values, variable-length types, MVCC, boundary shifts. Let’s go through it all.

To be clear, this is not a product (today). It’s thinking about how to apply the results of an academic paper to Postgres and all of the tricky challenges that can come from what seems to be an easy idea.

Comments closed