Data Science – Curated SQL

When Decision Trees Fail

Published 2025-11-24 by Kevin Feasel

Ivan Palomares Carrascosa builds an explanation:

In this article, you will learn why decision trees sometimes fail in practice and how to correct the most common issues with simple, effective techniques.

Topics we will cover include:

How to spot and reduce overfitting in decision trees.

How to recognize and fix underfitting by tuning model capacity.

How noisy or redundant features mislead trees and how feature selection helps.

Read on for some of the perils of CART and some ways to resolve them.

Four Measures for Vector Search Quality

Published 2025-11-18 by Kevin Feasel

Joe Sack explains four important measures:

You type “3-bedroom townhouse near a good school” into a home search site. It shows 10 homes. Some perfect, some okay, some wrong. How do you know if it’s working?

Four numbers help with this: Precision (what proportion is relevant), Recall (what you missed), MRR (how far to the first relevant result), nDCG (best stuff first).

Read on to learn what each one means and how it applies to vector search.

Pulling Random Values from a Gaussian Distribution in T-SQL

Published 2025-11-10 by Kevin Feasel

Sebastiao Pereira has another way of populating a random variable:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling. Used for simulation, inference, cryptography, and algorithm design for scientific, engineering, statistical, and AI domains. Is it possible to create random Gaussian numbers in SQL Server using the Ziggurat algorithm without external tools?

I was not familiar with this technique, so it’s neat to see it in action.

Calculating Exponential Moving Average in T-SQL

Published 2025-11-07 by Kevin Feasel

Rick Dobson watches the flow:

Exponential moving averages (emas) are a powerful means of detecting changes in time series data. However, if you are new to this task, you may be wondering how to choose from conflicting advice about how to calculate emas. This tip reviews several of the most popular methods for calculating moving averages. Additionally, this tip presents T-SQL code samples with common table expressions and stored procedures for generating emas from an underlying time series dataset.

“Emas don’t just track trends—they reveal momentum in motion.” That’s why they’re favored when recent values matter most—and why this tip focuses on helping you calculate them with precision.

Read on for the formula and a couple of lengthy scripts to generate it.

Comments closed

Random Number Generation in T-SQL via Marsaglia Polar Method

Published 2025-10-31 by Kevin Feasel

Sebastiao Pereira implements a method for generating random numbers in T-SQL:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling, simulation, inference, and algorithm design for scientific, engineering, statistical, and AI domains. How can we build a random number generator using Marsaglia Polar method in SQL Server without the use of external tools?

It’s an interesting technique that works well for drawing points from a two-dimensional space around a point.

Comments closed

LightSHAP

Published 2025-10-20 by Kevin Feasel

Michael Mayer announces a new Python package:

LightSHAP is here – a new, lightweight SHAP implementation for tabular data. While heavily inspired from the famous shap package, it has no dependency on it. LightSHAP simplifies working with dataframes (pandas, polars) and categorical data.

Read on to see how it works. Version 0.1.12 is the current version as of this post and it’s available via PyPi.

Comments closed

Generative Additive Models for Customer Lifetime Value Estimation

Published 2025-10-15 by Kevin Feasel

Nicholas Clark builds a GAM:

I typically work in quantitative ecology and molecular epidemiology, where we use statistical models to predict species distributions or disease transmission patterns. Recently though, I had an interesting conversation with a data science PhD student who mentioned they were applying GAMs to predict Customer Lifetime Value at a SaaS startup. This caught my attention because CLV prediction, as it turns out, faces remarkably similar statistical challenges to ecological forecasting: nonlinear relationships that saturate at biological or business limits, hierarchical structures where groups behave differently, and the need to balance model flexibility with interpretability for stakeholders who need to understand why the model makes certain predictions.

This is an interesting article and I had not thought of using a GAM for calculating Customer Lifetime Value. I used a much simpler technique the one time I calculated CLV in earnest. H/T R-Bloggers.

Comments closed

Choosing a Time Series Forecast Model

Published 2025-10-07 by Kevin Feasel

Ivan Palomares Carrascosa builds a matrix:

Time series data have the added complexity of temporal dependencies, seasonality, and possible non-stationarity.

Arguably, the most frequent predictive problem to address with time series data is forecasting i.e. predicting future values of a variable like temperature or stock price based on historical observations up to the present. With so many different models for time series forecasting, practitioners might sometimes find it difficult to choose the most suitable approach.

This article is designed to help, through the use of a decision matrix accompanied by explanations on when and why to employee different models depending on data characteristics and problem type.

Ivan breaks it out into two dimensions, data complexity and univariate/multivariate, and explains which types of algorithms might work best in each.

Comments closed

Simulating the Monty Hall Problem in R

Published 2025-10-06 by Kevin Feasel

Jason Bryer takes us through a classic introductory problem to Bayesian statistics:

I find that when teaching statistics (and probability) it is often helpful to simulate data first in order to get an understanding of the problem. The Monty Hall problem recently came up in a class so I implemented a function to play the game.

The Monty Hall problem results from a game show, Let’s Make a Deal, hosted by Monty Hall. In this game, the player picks one of three doors. Behind one is a car, the other two are goats. After picking a door the player is shown the contents of one of the other two doors, which because the host knows the contents, is a goat. The question to the player: Do you switch your choice?

This is one of the biggest “aha!” moments in statistics, in the sense that it is not intuitively obvious and is easy to get wrong, but once you understand why it is true, it makes reasoning over time and knowledge changes easier. H/T R-Bloggers.

Comments closed

A Primer on Principal Component Analysis

Published 2025-10-02 by Kevin Feasel

Harris Amjad explains the basics of principal component analysis:

In this series of tips, we will delve into the unsupervised learning branch of Machine Learning. Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction, but its mathematical foundation involving eigenvalues and eigenvectors can be intimidating. This tip aims to demystify PCA, explaining its purpose, how it works, and its use in visualizing high-dimensional data.

Click through to learn how it works. This is a solid primer.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Category: Data Science