Data Science – Curated SQL

Handling Imbalanced Data in Python

Published 2025-06-13 by Kevin Feasel

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Advanced Imputation Techniques via scikit-learn

Published 2025-06-09 by Kevin Feasel

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

A Primer on Loss Functions

Published 2025-06-06 by Kevin Feasel

Kanwal Mehreen compares loss functions:

I must say, with the ongoing hype around machine learning, a lot of people jump straight to the application side without really understanding how things work behind the scenes. What’s our objective with any machine learning model, anyway? You might say, “To make accurate predictions.” Fair enough.

But how do you actually tell your model, “You’re close” or “You’re way off”? How does it know it made a mistake — and by how much?

That’s where loss functions come in.

Read on to learn what loss functions are, how they work, and when you might want to choose each.

Comments closed

Extending caret for Spatial Machine Learning

Published 2025-05-15 by Kevin Feasel

Jan Linnenbrink looks at spatial data:

This document shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.

When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.

Click through to see how it all works. H/T R-Bloggers.

Comments closed

The Dual Perils of Overfitting and Data Leakage

Published 2025-05-07 by Kevin Feasel

John Mount shares notes on a theme:

One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.

Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.

Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.

Comments closed

Breaking down the Limitations of R^2

Published 2025-05-02 by Kevin Feasel

M. Fatih Tüzen explains an important regression concept:

When building a statistical model, one of the first numbers analysts and data scientists often cite is the R², or coefficient of determination. It’s widely reported in research, academic theses, and industry reports — and yet, frequently misunderstood or misused.

Does a high R² mean your model is good? Is it enough to evaluate model performance? What about its adjusted or predictive counterparts?

Read on to learn the answers to each question. H/T R-Bloggers.

Comments closed

Model Diagnostics for Statistics vs Machine Learning

Published 2025-05-02 by Kevin Feasel

Christian Lorentzen talks diagnostics:

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999.

Read on to learn more about this dataset and how the mindset differs if you’re thinking about inference versus prediction.

Comments closed

The Monty Hall Problem

Published 2025-04-16 by Kevin Feasel

I have a new video:

In this video, I explain the classic Monty Hall problem, based on the concept of the show Let’s Make a Deal. I explain the paradox behind the problem and demonstrate that it’s better to switch doors.

I’m not joking at all when I say it took me years of listening to explanations before it actually clicked. Some of it is my innate stubbornness, but I think this is a great example of a true paradox, where the intuitive answer is wrong and first-level reasoning also leads you astray.

Comments closed

Data Splitting and Cross-Validation in R

Published 2025-04-14 by Kevin Feasel

Nick Han has a pair of articles. First up is on data splitting and pre-processing:

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

H/T R-Bloggers for that one.

The second involves using cross-validation via the caret package in R:

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

H/T R-Bloggers for that as well.

Comments closed

Matrix Calculator and Operations Toolkit

Published 2025-04-10 by Kevin Feasel

Sebastiao Pereira hearkens back to linear algebra courses:

Is there a way to perform matrix operations or calculations directly within the database, eliminating the need for an external matrix calculator in SQL Server?

Click through for a large number of functions. Sebastiao went all-out in this one.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science