Press "Enter" to skip to content

Category: Data Science

Python Libraries for Advanced Time Series Forecasting

Ivan Palomares Carrascosa has a list:

Fortunately, Python’s ecosystem has evolved to meet this demand. The landscape has shifted from purely statistical packages to a rich array of libraries that integrate deep learning, machine learning pipelines, and classical econometrics. But with so many options, choosing the right framework can be overwhelming.

This article cuts through the noise to focus on 5 powerhouse Python libraries designed specifically for advanced time series forecasting. We move beyond the basics to explore tools capable of handling high-dimensional data, complex seasonality, and exogenous variables. For each library, we provide a high-level overview of its standout features and a concise “Hello World” code snippet to familiarize yourself immediately.

Click through for an explanation of each of the five libraries.

Leave a Comment

How Data Leakage Can Hurt Model Performance

Ivan Palomares Carrascosa leaks some data:

In this article, you will learn what data leakage is, how it silently inflates model performance, and practical patterns for preventing it across common workflows.

Topics we will cover include:

  • Identifying target leakage and removing target-derived features.
  • Preventing train–test contamination by ordering preprocessing correctly.
  • Avoiding temporal leakage in time series with proper feature design and splits.

Read on to learn more.

Leave a Comment

Using Haskell for Data Science

Jonathan Carroll has my attention:

I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed from R until I found the dataHaskell (www.datahaskell.org) project.

There have been several attempts recently to enhance R with some strong types, e.g.  vapour (vapour.run), typr (github.com), using {rlang}’s checks (josiahparry.com), and even discussions about implementations at the core level e.g.  in September 2025 (stat.ethz.ch) continued in November 2025 (stat.ethz.ch). While these try to bend R towards types, perhaps an all-in solution makes more sense.

In this post I’ll demonstrate some of the features and explain why I think it makes for a good (great?) data science language.

I’ve been a big fan of F# for data science work as well for similar reasons, so it was interesting to read this article on Haskell. H/T R-Bloggers.

Comments closed

When Decision Trees Fail

Ivan Palomares Carrascosa builds an explanation:

In this article, you will learn why decision trees sometimes fail in practice and how to correct the most common issues with simple, effective techniques.

Topics we will cover include:

  • How to spot and reduce overfitting in decision trees.
  • How to recognize and fix underfitting by tuning model capacity.
  • How noisy or redundant features mislead trees and how feature selection helps.

Read on for some of the perils of CART and some ways to resolve them.

Comments closed

Four Measures for Vector Search Quality

Joe Sack explains four important measures:

You type “3-bedroom townhouse near a good school” into a home search site. It shows 10 homes. Some perfect, some okay, some wrong. How do you know if it’s working?

Four numbers help with this: Precision (what proportion is relevant), Recall (what you missed), MRR (how far to the first relevant result), nDCG (best stuff first).

Read on to learn what each one means and how it applies to vector search.

Comments closed

Pulling Random Values from a Gaussian Distribution in T-SQL

Sebastiao Pereira has another way of populating a random variable:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling. Used for simulation, inference, cryptography, and algorithm design for scientific, engineering, statistical, and AI domains. Is it possible to create random Gaussian numbers in SQL Server using the Ziggurat algorithm without external tools?

I was not familiar with this technique, so it’s neat to see it in action.

Comments closed

Calculating Exponential Moving Average in T-SQL

Rick Dobson watches the flow:

Exponential moving averages (emas) are a powerful means of detecting changes in time series data. However, if you are new to this task, you may be wondering how to choose from conflicting advice about how to calculate emas. This tip reviews several of the most popular methods for calculating moving averages. Additionally, this tip presents T-SQL code samples with common table expressions and stored procedures for generating emas from an underlying time series dataset.

“Emas don’t just track trends—they reveal momentum in motion.” That’s why they’re favored when recent values matter most—and why this tip focuses on helping you calculate them with precision.

Read on for the formula and a couple of lengthy scripts to generate it.

Comments closed

Random Number Generation in T-SQL via Marsaglia Polar Method

Sebastiao Pereira implements a method for generating random numbers in T-SQL:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling, simulation, inference, and algorithm design for scientific, engineering, statistical, and AI domains. How can we build a random number generator using Marsaglia Polar method in SQL Server without the use of external tools?

It’s an interesting technique that works well for drawing points from a two-dimensional space around a point.

Comments closed

LightSHAP

Michael Mayer announces a new Python package:

LightSHAP is here – a new, lightweight SHAP implementation for tabular data. While heavily inspired from the famous shap package, it has no dependency on it. LightSHAP simplifies working with dataframes (pandas, polars) and categorical data.

Read on to see how it works. Version 0.1.12 is the current version as of this post and it’s available via PyPi.

Comments closed

Generative Additive Models for Customer Lifetime Value Estimation

Nicholas Clark builds a GAM:

I typically work in quantitative ecology and molecular epidemiology, where we use statistical models to predict species distributions or disease transmission patterns. Recently though, I had an interesting conversation with a data science PhD student who mentioned they were applying GAMs to predict Customer Lifetime Value at a SaaS startup. This caught my attention because CLV prediction, as it turns out, faces remarkably similar statistical challenges to ecological forecasting: nonlinear relationships that saturate at biological or business limits, hierarchical structures where groups behave differently, and the need to balance model flexibility with interpretability for stakeholders who need to understand why the model makes certain predictions.

This is an interesting article and I had not thought of using a GAM for calculating Customer Lifetime Value. I used a much simpler technique the one time I calculated CLV in earnest. H/T R-Bloggers.

Comments closed