Press "Enter" to skip to content

Curated SQL Posts

Estimating Probabilities from Unevenly Collected Data

Nina Zumel answers an important question:

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

  • The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.
  • The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.
  • The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we’ll try to estimate the “innate” batting ability (the probability of making a hit when at bat) of major league baseball players in 2023. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

It’s an interesting problem because she’s looking at 2023 data as an estimation of the player’s entire career, with the goal of estimating how a player will perform overall given a fairly reasonably sized sample of information collected from one relatively short period of that player’s career. H/T John Mount.

Leave a Comment

A Challenge of Visualizing Game Statistics

Kieran Healy scratches his head:

I just finished driving a very long way up the side of the country, so I’m kind of tired. But even allowing for that, boy, this way of representing things really is quite confusing. Not being an Apple Sports user I had to look at it for a bit to understand what was happening. But, now that it has given me a headache, I can kind of see why whoever designed this ended up in the undoubtedly bad place they did.

Before I get to why I have some sympathy for the designer, why did I find this representation of these numbers so disorienting? It’s not just just because I’ve been driving for nine hours. John is right to call the picture a “Zero Sum” representation. The design strongly suggests to the viewer that, within each row, we’re looking at each team’s share of a total. Each pair of black and blue lines seem to be vying for control of their whole row, with the longest line being the “winner” in each case.

Click through for the challenge, as well as a trio of attempts to improve the results. The tornado chart at the end is probably what I’d go with if I needed to include all of this on a single chart. H/T R-Bloggers.

Leave a Comment

A Look at Tabular Foundation Models

Michael Mayer tries out a neural network model:

Tabular data has had a comfortable life for years. Gradient boosting showed up, got very good at its job, and then quietly became the default answer to almost everything with rows and columns.

In very recent years, a new player has arrived: the tabular foundation model or prior fitted neural network, and suddenly tabular data is sounding a lot less sleepy…

I’ve done a bit with TabPFN and come away fairly impressed. I’ll have to give this a go as well. There are definite limitations to data sizes before things fall over, but for moderate sizes (50k or fewer rows), TabPFN at least worked pretty well.

Leave a Comment

Dimensional Testing in Kafka

Jack Vanlightly announces a new tool:

Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka.

Click through for an overview of the tool and how it works.

Leave a Comment

ORDER BY COALESCE() in PostgreSQL (and SQL Server)

Laetitia Avrot digs in:

I was reading Markus Winand’s latest post on ORDER BY history last week. If you haven’t read it yet, go read it. Markus is one of the best writers on SQL standards, and this post is no exception.

One line stopped me cold. The compatibility table for “expressions on selected columns.” Postgres: partial. PostgreSQL 18: still partial.

That itch needed scratching.

The basic version of this is that you cannot use the alias of a computed expression in a function in the ORDER BY clause in either PostgreSQL or SQL Server. In other words, the following fails:

SELECT a + b AS x
FROM t
ORDER BY COALESCE(x, 0);

Read on for an explanation of why this is the case in PostgreSQL. I’d imagine that the reasoning is about the same for SQL Server.

Leave a Comment

Adding Foreign Keys and Deadlocks

Michael J. Swart explains a challenge in adding a foreign key to an existing table:

Schema modification locks (SCH-M) are taken by DDL (Data Definition Language) statements like CREATE/ALTER/DROP.
Schema stability locks (SCH-S) are taken by DML (Data Manipulation Language) statements like INSERT/UPDATE/DELETE.

Those two types of locks are incompatible. Meaning, I can’t get a SCH-S lock on some table if you’ve already got a SCH-M lock on it (and vice versa).
Paul Randal describes the SCH-M lock as a super-table-X lock. It makes sense to me, if I’m half way through querying a table, I don’t want its definition to change.

Such a pessimistic lock can be awkward for a busy system. The SCH-M can cause a lot of blocking. For example, creating (and dropping) foreign keys requires a SCH-M lock not only on the parent table, but also on the referenced table which leads to trouble.

Click through for a demonstration of the problem. Michael also has some guidance on how to minimize the issue. I’d note the degenerative form of this guidance: understand your data model up-front and apply foreign key constraints at table creation time. That’s not always possible, sure, so when you can’t do that, Michael has some good advice.

Leave a Comment

Explaining the Fabric Ontology

James Serra takes us through a big word:

For years, most data conversations have started with tables. We ask where the data lives, what columns are available, how the joins work, and whether the data is in a warehouse, lakehouse, semantic model, or some other system. That makes sense, because tables are how most of us have worked with data for decades. But tables are not how the business thinks.

A business thinks in terms of customers, products, orders, shipments, assets, flights, runways, employees, policies, and actions. The problem is not usually a lack of data. The problem is a lack of shared meaning. Organizations often have the same business concept represented multiple ways across teams and systems, creating what I would call semantic drift. Sales may define a customer one way. Finance may define it another way. Operations may have yet another version in a different system with different keys, names, and assumptions. That is exactly where Fabric Ontology becomes important. It is designed to close the gap between physical data structures and business meaning.

Microsoft is a bit late to the ontology game and their current concept of an ontology shows. I can understand where they’re going but they still have a ways to go.

Leave a Comment

Working with PIVOT and UNPIVOT in SQL Server

Ed Pollack explains a pair of operators:

There are few operators in T-SQL that cause developers to scramble for documentation more than PIVOT and UNPIVOT. Beyond documentation, transforming columns into rows (and vice-versa) can often be confusing and frustrating for those of us tasked with reformatting data for use by an application.

This article walks through PIVOT and UNPIVOT, providing examples of simple use cases for both – as well as some more complex scenarios we can run into in real-world data. These can be extraordinarily useful ways to reformat data efficiently and quickly with less code than the alternatives. So, there is no need to fear them again!

Click through for Ed’s article. I definitely don’t fear either PIVOT or UNPIVOT and they can be quite useful. But if you locked me in a room and I couldn’t leave until I came up with the proper syntax for both from memory, well, I’d be in that room for a while.

Leave a Comment

The Pain of NULL

Louis Davidson explains the unknown:

There is no simpler topic in relational comparisons than three valued logic. I am being mostly facetious about this, but in reality, it seems so simple that people don’t think about how a NULL works, and make mistakes all of the time. I was reading a post about this the other day on LinkedIn (which by no means could one ever find again!) where one of the comments chastised the author of the post for not understanding “the fundamentals” of relational theory. The original poster wasn’t completely right (and my post may not be completely perfect either, though I will back most of what I write with code.)

In this post I want to point out a few of the key basics that one really should understand.

Click through for a primer on what NULL means and doesn’t mean. And by the time you’re done, I’d like to interest you in the power of 6th normal form, where you can effectively banish NULL into the abyss (at least until you join the bits back together).

Leave a Comment

Finding Bad Queries with sp_QuickieCache

Erik Darling is on a hunt for bad queries:

That’s the funny part. Alright. Cool. With that out of the way, let’s look at this new store[d] procedure. Uh, I think I have to go to Management Studio. Yeah, I remember what that looks like. Alright. Cool. So, uh, this is, this is it. SPQuickieCache. Pay no attention to the terrible red squiggly underlines.

I think Erik made his AI business partner angry, as it didn’t strip out any of the filler words from the transcript. But this does look like a neat stored procedure.

1 Comment