Data Science – Curated SQL

Making k-Means Clustering Better

Published 2025-07-18 by Kevin Feasel

The k-means algorithm is a cornerstone of unsupervised machine learning, known for its simplicity and trusted for its efficiency in partitioning data into a predetermined number of clusters. Its straightforward approach — assigning data points to the nearest centroid and then updating the centroid based on the mean of the assigned points — makes it one of the first algorithms most data scientists learn. It is a workhorse, capable of providing quick and valuable insights into the underlying structure of a dataset.

This simplicity comes with a set of limitations, however. Standard k-means often struggles when faced with the complexities of real-world data. Its performance can be sensitive to the initial placement of centroids, it requires the number of clusters to be specified in advance, and it fundamentally assumes that clusters are spherical and evenly sized. These assumptions rarely hold true in the wild, leading to suboptimal or even misleading results.

Read on for a few ways to relax some of the constraints in k-means clustering.

Leave a Comment

Choosing a Good Split for a Decision Tree

Published 2025-07-17 by Kevin Feasel

Ivan Palomares Carrascosa continues a series on decision trees:

But what are the underlying mechanisms that make decision trees so well-suited for various predictive tasks? And what criteria are internally used to construct them? Specifically, how are nodes recursively split as the tree-shaped structure is formed? This article takes a closer look at the inner workings of decision trees, focusing on how branches are created through deliberate, data-driven splitting (spoiler: it certainly doesn’t happen at random).

One of the main principles of CART is around finding efficient splits for trees, and this digs into some of those details.

Leave a Comment

Decision Trees and Non-Tabular Data

Published 2025-07-11 by Kevin Feasel

Ivan Palomares Carrascosa explains that you can use more than standard structured data against decision trees:

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

Leave a Comment

The Through-the-Door Problem in Credit Risk Modeling

Published 2025-07-10 by Kevin Feasel

Richard Vale takes us through a data challenge:

In credit risk modelling, you want to calculate the probability that a loan will default. Since different financial institutions gather different data and offer different products, there is no one-size-fits-all approach to doing this. Therefore, credit risk models are usually built using the institution’s own data. For example, if I’m building a credit risk model for XYZ Bank, I look at loans which XYZ bank has previously granted, and try to estimate the probability that a future loan will default based on principal, tenor, the borrower’s credit rating, and so on.

For those who haven’t heard of the through-the-door problem before, this is a good moment to pause and think about what is wrong with this. Why does this process contain a huge pitfall?

Click through for the answer, as well as an example of the problem and one way to get around this. H/T R-Bloggers.

Leave a Comment

Spatial Cross-Validation in R

Published 2025-07-09 by Kevin Feasel

Jakub Nowosad wraps up a series:

This document provides an overview of two R packages, sperrorest and blockCV, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like caret, tidymodels, or mlr3.

All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.

Click through for a pair of cross-validation packages, as well as a link to the rest of the series. H/T R-Bloggers.

Leave a Comment

Handling Imbalanced Data in Python

Published 2025-06-13 by Kevin Feasel

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Comments closed

Advanced Imputation Techniques via scikit-learn

Published 2025-06-09 by Kevin Feasel

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

A Primer on Loss Functions

Published 2025-06-06 by Kevin Feasel

Kanwal Mehreen compares loss functions:

I must say, with the ongoing hype around machine learning, a lot of people jump straight to the application side without really understanding how things work behind the scenes. What’s our objective with any machine learning model, anyway? You might say, “To make accurate predictions.” Fair enough.

But how do you actually tell your model, “You’re close” or “You’re way off”? How does it know it made a mistake — and by how much?

That’s where loss functions come in.

Read on to learn what loss functions are, how they work, and when you might want to choose each.

Comments closed

Extending caret for Spatial Machine Learning

Published 2025-05-15 by Kevin Feasel

Jan Linnenbrink looks at spatial data:

This document shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.

When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.

Click through to see how it all works. H/T R-Bloggers.

Comments closed

The Dual Perils of Overfitting and Data Leakage

Published 2025-05-07 by Kevin Feasel

John Mount shares notes on a theme:

One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.

Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.

Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science