Press "Enter" to skip to content

Day: February 19, 2026

Comparing Techniques for Text Featurization in Classification Problems

Ivan Palomaras Carrascosa tries a few things:

In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Click through for results. Granted, the specific embedding model can alter the quality of results, but even so, I do enjoy the comparison of techniques and the reminder that neural networks aren’t the ultimate solution to everything.

Leave a Comment

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Leave a Comment

Web Scraping with Python

Jason Yousef has a script:

Below is a production-friendly pattern that:

  • Uses a requests.Session with retries, backoff, and a real User-Agent
  • Sets sane timeouts and handles common HTTP errors
  • Respects robots.txt (and tells you if scraping is disallowed)
  • Parses only mailto: links by default to avoid scraping personal data you shouldn’t
  • Handles pagination with a “Next” link when present
  • Exports to CSV
  • Can be run from the command line with arguments

Click through for the code, some explanation of how it works, and a few tips.

Leave a Comment

Running Totals over Arbitrary Date Ranges

Louis Davidson solves an interval problem:

Say you want to find the most recent 30-day period during which a person purchased some amount of products from your company. How you market to a customer might change if they have been active over a time period recently, or even in the past. But this also means that for each day going back in history, you need to sum historic data over and over, and the previous 29 days of activity. This is generally known as a rolling total. Doing this sort of calculation has been an interesting problem for many years.

When window functions came around, they became quite useful for such tasks, but they have one kind of complicated problem: gaps in source data patterns.

Funnily enough, there is a solution using window functions: range intervals. The ANSI SQL definition for RANGE (versus ROWS) for window functions does allow for the specification of a date range, like RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW. Very impressive.

Unfortunately, SQL Server doesn’t support these. PostgreSQL does, but it’s an area I’ve agitated about for a few years and I do hope that someday, the SQL Server product team will support this functionality. In the meantime, Louis has a solution that works well for the task.

Leave a Comment