Press "Enter" to skip to content

Category: Data Science

Estimating Probabilities from Unevenly Collected Data

Nina Zumel answers an important question:

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

  • The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.
  • The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.
  • The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we’ll try to estimate the “innate” batting ability (the probability of making a hit when at bat) of major league baseball players in 2023. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

It’s an interesting problem because she’s looking at 2023 data as an estimation of the player’s entire career, with the goal of estimating how a player will perform overall given a fairly reasonably sized sample of information collected from one relatively short period of that player’s career. H/T John Mount.

Leave a Comment

A Look at Tabular Foundation Models

Michael Mayer tries out a neural network model:

Tabular data has had a comfortable life for years. Gradient boosting showed up, got very good at its job, and then quietly became the default answer to almost everything with rows and columns.

In very recent years, a new player has arrived: the tabular foundation model or prior fitted neural network, and suddenly tabular data is sounding a lot less sleepy…

I’ve done a bit with TabPFN and come away fairly impressed. I’ll have to give this a go as well. There are definite limitations to data sizes before things fall over, but for moderate sizes (50k or fewer rows), TabPFN at least worked pretty well.

Leave a Comment

Probabilistic Time Series Cross-Validation in R

Thierry Moudiki checks an interval:

A previous post introduced the crossvalidation package for R. This time, the focus is on probabilistic forecasting — evaluating not just how accurate point forecasts are, but how well-calibrated prediction intervals are, using empirical coverage rates and Winkler scores – and crossvalidation.

Click through for the code and not much additional commentary. H/T R-Bloggers.

Leave a Comment

Scoring the Quality of Binary Classification with SQL Server

Sebastiao Pereira quantifies a result:

Machine Learning (ML) is a way of teaching computers to learn from data instead of being explicitly programmed. Performance metrics are essential tools for understanding how well a model actually works. They tell you not just how accurate the model is, but how reliablefair, and useful it will be in real-world applications. In other words, without them, machine learning would be a trial-and-error guesswork.

Binary classification is when each sample is labeled as one of two mutually exclusive classes, referenced to a categorization, like positive or negative.

How do you implement the binary classification performance metric in SQL Server without using external tools?

Click through for a series of metrics to determine how well a binary classification process performed. This post doesn’t include details on how to perform the classification, just what to do once you have the results.

Leave a Comment

When R^2 Misleads

Holger von Jouanne-Diedrich explains a common quality metric for regression analyses:

A high R^2 can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high R^2 is not always a sign of a good model, read on!

Click through for that explanation. This post does a fantastic job of explaining the technical reasons why a high R^2 might not be indicative of a good model specification. But I’d add one other piece to the puzzle: what constitutes a high R^2 will depend very much on the domain. For example, if you are performing a regression of some process in physics, an R^2 of 0.90 is probably so low as to indicate you’ve made a horrible mistake somewhere to have a number so low.

By contrast, an R^2 of 0.90 in the context of a social studies analysis would get you laughed out of the room for obviously faking the data or misunderstanding the specification to get a number that high.

Leave a Comment

Training, Serving, and Deploying Scikit-Learn Models via FastAPI

Abid Ali Awan serves a model:

In this article, you will learn how to train a Scikit-learn classification model, serve it with FastAPI, and deploy it to FastAPI Cloud.

Topics we will cover include:

  • How to structure a simple project and train a Scikit-learn model for inference.
  • How to build and test a FastAPI inference API locally.
  • How to deploy the API to FastAPI Cloud and prepare it for more production-ready usage.

Click through for the process.

Comments closed

Cross-Workspace MLflow Logging Available in Microsoft Fabric

Ruixin Xu announces a feature now generally available:

Cross-workspace logging works through the synapseml-mlflow package, which provides a Fabric-compatible MLflow tracking plugin. The core idea is simple: set the MLFLOW_TRACKING_URI* to point at your target workspace and use standard MLflow commands. Your experiments, metrics, parameters, and registered models land in the workspace you choose — not just the one you’re running in.

Read on for the full announcement.

Comments closed

Implementing SOFTMAX in SQL Server

Sebastiao Pereira is back with another formula:

The SOFTMAX function takes raw scores and converts into a probability distribution. This mathematical function is used in neural networking training, multiclass classification methods, multinomial logistic regression, multiclass linear discriminant analysis, and naïve Bayes classifiers. How can this function be built in SQL Server?

Click through for the implementation.

Comments closed

Updating a Mean without Original Data Points

John Cook has an interesting solution:

This post will look at the problem of updating an average grade as a very simple special case of Bayesian statistics and of Kalman filtering.

Suppose you’re keeping up with your average grade in a class, and you know your average after n tests, all weighted equally.

Click through for the walkthrough. This is similar to something I tried to puzzle out but ultimately admitted defeat: is there a way to calculate updates to the median without needing to know the entire set? In practical terms, this would be something like, how many pieces of information do I need to guarantee that I can maintain a median over time?

The best I could come up with was built along the premise of the likelihood of new data points being less than the median versus those greater than the median, where each pair of greater-lesser cancel each other out. If you have roughly equal numbers of new data points to each side, your “elements of the median” array can be pretty small. But the problem is, for any sufficiently small k, where k represents the number of elements you keep in memory, it is possible for a localized collection of (without loss of generality) lower-than-median data points to come in and completely wash out your memory. For example, if you kept 3 points and memory and you have four values below the median, you no longer know what the median is.

Trying to solve this without knowing the shape of the distribution or make any sequencing assumptions is something that I failed to do.

Comments closed

Implementing Shamir’s Secret Sharing in SQL Server

Sebastiao Pereira implements an algorithm:

Shamir’s Secret Sharing is a cryptographic algorithm that allows a secret to be split into multiple components and shared among a group in such a way that the secret can only be revealed if a minimum number of components are combined. Is it possible to have this algorithm implemented in SQL Server without using external tools?

Click through for a T-SQL implementation, as well as one using CLR.

Comments closed