Data Science – Page 2

A Primer on Loss Functions

Published 2025-06-06 by Kevin Feasel

I must say, with the ongoing hype around machine learning, a lot of people jump straight to the application side without really understanding how things work behind the scenes. What’s our objective with any machine learning model, anyway? You might say, “To make accurate predictions.” Fair enough.

But how do you actually tell your model, “You’re close” or “You’re way off”? How does it know it made a mistake — and by how much?

That’s where loss functions come in.

Read on to learn what loss functions are, how they work, and when you might want to choose each.

Comments closed

Extending caret for Spatial Machine Learning

Published 2025-05-15 by Kevin Feasel

Jan Linnenbrink looks at spatial data:

This document shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.

When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.

Click through to see how it all works. H/T R-Bloggers.

Comments closed

The Dual Perils of Overfitting and Data Leakage

Published 2025-05-07 by Kevin Feasel

John Mount shares notes on a theme:

One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.

Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.

Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.

Comments closed

Model Diagnostics for Statistics vs Machine Learning

Published 2025-05-02 by Kevin Feasel

Christian Lorentzen talks diagnostics:

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999.

Read on to learn more about this dataset and how the mindset differs if you’re thinking about inference versus prediction.

Comments closed

Breaking down the Limitations of R^2

Published 2025-05-02 by Kevin Feasel

M. Fatih Tüzen explains an important regression concept:

When building a statistical model, one of the first numbers analysts and data scientists often cite is the R², or coefficient of determination. It’s widely reported in research, academic theses, and industry reports — and yet, frequently misunderstood or misused.

Does a high R² mean your model is good? Is it enough to evaluate model performance? What about its adjusted or predictive counterparts?

Read on to learn the answers to each question. H/T R-Bloggers.

Comments closed

The Monty Hall Problem

Published 2025-04-16 by Kevin Feasel

I have a new video:

In this video, I explain the classic Monty Hall problem, based on the concept of the show Let’s Make a Deal. I explain the paradox behind the problem and demonstrate that it’s better to switch doors.

I’m not joking at all when I say it took me years of listening to explanations before it actually clicked. Some of it is my innate stubbornness, but I think this is a great example of a true paradox, where the intuitive answer is wrong and first-level reasoning also leads you astray.

Comments closed

Data Splitting and Cross-Validation in R

Published 2025-04-14 by Kevin Feasel

Nick Han has a pair of articles. First up is on data splitting and pre-processing:

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

H/T R-Bloggers for that one.

The second involves using cross-validation via the caret package in R:

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

H/T R-Bloggers for that as well.

Comments closed

Matrix Calculator and Operations Toolkit

Published 2025-04-10 by Kevin Feasel

Sebastiao Pereira hearkens back to linear algebra courses:

Is there a way to perform matrix operations or calculations directly within the database, eliminating the need for an external matrix calculator in SQL Server?

Click through for a large number of functions. Sebastiao went all-out in this one.

Comments closed

The Basics of Bayes’ Theorem

Published 2025-04-09 by Kevin Feasel

I have a new video:

In this video, I provide an introduction to Bayes’ theorem, explaining the key concepts and terms, as well as solving a totally realistic problem via Bayesian analysis.

My goal in this video was to explain a counter-intuitive phenomenon: how much a positive test or piece of information moves the needle depends primarily on how frequently the event normally happens and how frequently we generate false positives in tests. The less common the scenario, the less your positive test actually moves the needle.

Comments closed

Converting a CSV to Parquet with DuckDB and Polars in R

Published 2025-04-03 by Kevin Feasel

Michael Mayer makes a swap:

In this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and Python.

Different people have contacted me and asked: “and in R?”

Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {polars} which is currently being overhauled into {neopandas}.

Click through for the comparison.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Data Science