Press "Enter" to skip to content

Day: May 22, 2026

Estimating Probabilities from Unevenly Collected Data

Nina Zumel answers an important question:

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

  • The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.
  • The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.
  • The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we’ll try to estimate the “innate” batting ability (the probability of making a hit when at bat) of major league baseball players in 2023. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

It’s an interesting problem because she’s looking at 2023 data as an estimation of the player’s entire career, with the goal of estimating how a player will perform overall given a fairly reasonably sized sample of information collected from one relatively short period of that player’s career. H/T John Mount.

Leave a Comment

A Challenge of Visualizing Game Statistics

Kieran Healy scratches his head:

I just finished driving a very long way up the side of the country, so I’m kind of tired. But even allowing for that, boy, this way of representing things really is quite confusing. Not being an Apple Sports user I had to look at it for a bit to understand what was happening. But, now that it has given me a headache, I can kind of see why whoever designed this ended up in the undoubtedly bad place they did.

Before I get to why I have some sympathy for the designer, why did I find this representation of these numbers so disorienting? It’s not just just because I’ve been driving for nine hours. John is right to call the picture a “Zero Sum” representation. The design strongly suggests to the viewer that, within each row, we’re looking at each team’s share of a total. Each pair of black and blue lines seem to be vying for control of their whole row, with the longest line being the “winner” in each case.

Click through for the challenge, as well as a trio of attempts to improve the results. The tornado chart at the end is probably what I’d go with if I needed to include all of this on a single chart. H/T R-Bloggers.

Leave a Comment

A Look at Tabular Foundation Models

Michael Mayer tries out a neural network model:

Tabular data has had a comfortable life for years. Gradient boosting showed up, got very good at its job, and then quietly became the default answer to almost everything with rows and columns.

In very recent years, a new player has arrived: the tabular foundation model or prior fitted neural network, and suddenly tabular data is sounding a lot less sleepy…

I’ve done a bit with TabPFN and come away fairly impressed. I’ll have to give this a go as well. There are definite limitations to data sizes before things fall over, but for moderate sizes (50k or fewer rows), TabPFN at least worked pretty well.

Leave a Comment

Dimensional Testing in Kafka

Jack Vanlightly announces a new tool:

Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka.

Click through for an overview of the tool and how it works.

Leave a Comment

ORDER BY COALESCE() in PostgreSQL (and SQL Server)

Laetitia Avrot digs in:

I was reading Markus Winand’s latest post on ORDER BY history last week. If you haven’t read it yet, go read it. Markus is one of the best writers on SQL standards, and this post is no exception.

One line stopped me cold. The compatibility table for “expressions on selected columns.” Postgres: partial. PostgreSQL 18: still partial.

That itch needed scratching.

The basic version of this is that you cannot use the alias of a computed expression in a function in the ORDER BY clause in either PostgreSQL or SQL Server. In other words, the following fails:

SELECT a + b AS x
FROM t
ORDER BY COALESCE(x, 0);

Read on for an explanation of why this is the case in PostgreSQL. I’d imagine that the reasoning is about the same for SQL Server.

Leave a Comment