Press "Enter" to skip to content

Category: Data Science

Pulling Random Values from a Gaussian Distribution in T-SQL

Sebastiao Pereira has another way of populating a random variable:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling. Used for simulation, inference, cryptography, and algorithm design for scientific, engineering, statistical, and AI domains. Is it possible to create random Gaussian numbers in SQL Server using the Ziggurat algorithm without external tools?

I was not familiar with this technique, so it’s neat to see it in action.

Leave a Comment

Calculating Exponential Moving Average in T-SQL

Rick Dobson watches the flow:

Exponential moving averages (emas) are a powerful means of detecting changes in time series data. However, if you are new to this task, you may be wondering how to choose from conflicting advice about how to calculate emas. This tip reviews several of the most popular methods for calculating moving averages. Additionally, this tip presents T-SQL code samples with common table expressions and stored procedures for generating emas from an underlying time series dataset.

“Emas don’t just track trends—they reveal momentum in motion.” That’s why they’re favored when recent values matter most—and why this tip focuses on helping you calculate them with precision.

Read on for the formula and a couple of lengthy scripts to generate it.

Leave a Comment

Random Number Generation in T-SQL via Marsaglia Polar Method

Sebastiao Pereira implements a method for generating random numbers in T-SQL:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling, simulation, inference, and algorithm design for scientific, engineering, statistical, and AI domains. How can we build a random number generator using Marsaglia Polar method in SQL Server without the use of external tools?

It’s an interesting technique that works well for drawing points from a two-dimensional space around a point.

Leave a Comment

LightSHAP

Michael Mayer announces a new Python package:

LightSHAP is here – a new, lightweight SHAP implementation for tabular data. While heavily inspired from the famous shap package, it has no dependency on it. LightSHAP simplifies working with dataframes (pandas, polars) and categorical data.

Read on to see how it works. Version 0.1.12 is the current version as of this post and it’s available via PyPi.

Comments closed

Generative Additive Models for Customer Lifetime Value Estimation

Nicholas Clark builds a GAM:

I typically work in quantitative ecology and molecular epidemiology, where we use statistical models to predict species distributions or disease transmission patterns. Recently though, I had an interesting conversation with a data science PhD student who mentioned they were applying GAMs to predict Customer Lifetime Value at a SaaS startup. This caught my attention because CLV prediction, as it turns out, faces remarkably similar statistical challenges to ecological forecasting: nonlinear relationships that saturate at biological or business limits, hierarchical structures where groups behave differently, and the need to balance model flexibility with interpretability for stakeholders who need to understand why the model makes certain predictions.

This is an interesting article and I had not thought of using a GAM for calculating Customer Lifetime Value. I used a much simpler technique the one time I calculated CLV in earnest. H/T R-Bloggers.

Comments closed

Choosing a Time Series Forecast Model

Ivan Palomares Carrascosa builds a matrix:

Time series data have the added complexity of temporal dependencies, seasonality, and possible non-stationarity.

Arguably, the most frequent predictive problem to address with time series data is forecasting i.e. predicting future values of a variable like temperature or stock price based on historical observations up to the present. With so many different models for time series forecasting, practitioners might sometimes find it difficult to choose the most suitable approach.

This article is designed to help, through the use of a decision matrix accompanied by explanations on when and why to employee different models depending on data characteristics and problem type.

Ivan breaks it out into two dimensions, data complexity and univariate/multivariate, and explains which types of algorithms might work best in each.

Comments closed

Simulating the Monty Hall Problem in R

Jason Bryer takes us through a classic introductory problem to Bayesian statistics:

I find that when teaching statistics (and probability) it is often helpful to simulate data first in order to get an understanding of the problem. The Monty Hall problem recently came up in a class so I implemented a function to play the game.

The Monty Hall problem results from a game show, Let’s Make a Deal, hosted by Monty Hall. In this game, the player picks one of three doors. Behind one is a car, the other two are goats. After picking a door the player is shown the contents of one of the other two doors, which because the host knows the contents, is a goat. The question to the player: Do you switch your choice?

This is one of the biggest “aha!” moments in statistics, in the sense that it is not intuitively obvious and is easy to get wrong, but once you understand why it is true, it makes reasoning over time and knowledge changes easier. H/T R-Bloggers.

Comments closed

A Primer on Principal Component Analysis

Harris Amjad explains the basics of principal component analysis:

In this series of tips, we will delve into the unsupervised learning branch of Machine Learning. Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction, but its mathematical foundation involving eigenvalues and eigenvectors can be intimidating. This tip aims to demystify PCA, explaining its purpose, how it works, and its use in visualizing high-dimensional data.

Click through to learn how it works. This is a solid primer.

Comments closed

Choosing between Data Scalers in a Data Science Project

Bala Pirya C performs a comparison:

In this article, you will learn how MinMaxScaler, StandardScaler, and RobustScaler transform skewed, outlier-heavy data, and how to pick the right one for your modeling pipeline.

Topics we will cover include:

  • How each scaler works and where it breaks on skewed or outlier-rich data
  • A realistic synthetic dataset to stress-test the scalers
  • A practical, code-ready heuristic for choosing a scaler

Read on to learn more about each of these three scaler types, the use cases that best fit each of them, and even a flow chart at the end.

Comments closed

Cross-Validation and Time Series Data

Vlad Johnson takes us through a technique to test time series results:

Time series modeling, compared to traditional nontemporal modeling, presents unique challenges in ensuring that models generalize well to future, unseen data. One key methodology to address these challenges is cross-validation.

Time series data inherently contains temporal dependencies — observations are ordered in time, and future values may depend on past trends. This structure makes it challenging to estimate how well a model will perform on new, unseen data.

Click through for an explanation of cross-validation, why this becomes challenging when you have time series data (or other serially correlated data), and tips to resolve this challenge.

Comments closed