Press "Enter" to skip to content

Category: Data Science

Decision Trees and Non-Tabular Data

Ivan Palomares Carrascosa explains that you can use more than standard structured data against decision trees:

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

Leave a Comment

The Through-the-Door Problem in Credit Risk Modeling

Richard Vale takes us through a data challenge:

In credit risk modelling, you want to calculate the probability that a loan will default. Since different financial institutions gather different data and offer different products, there is no one-size-fits-all approach to doing this. Therefore, credit risk models are usually built using the institution’s own data. For example, if I’m building a credit risk model for XYZ Bank, I look at loans which XYZ bank has previously granted, and try to estimate the probability that a future loan will default based on principal, tenor, the borrower’s credit rating, and so on.

For those who haven’t heard of the through-the-door problem before, this is a good moment to pause and think about what is wrong with this. Why does this process contain a huge pitfall?

Click through for the answer, as well as an example of the problem and one way to get around this. H/T R-Bloggers.

Leave a Comment

Spatial Cross-Validation in R

Jakub Nowosad wraps up a series:

This document provides an overview of two R packages, sperrorest and blockCV, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like carettidymodels, or mlr3.

All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.

Click through for a pair of cross-validation packages, as well as a link to the rest of the series. H/T R-Bloggers.

Leave a Comment

Handling Imbalanced Data in Python

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Comments closed

Advanced Imputation Techniques via scikit-learn

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

A Primer on Loss Functions

Kanwal Mehreen compares loss functions:

I must say, with the ongoing hype around machine learning, a lot of people jump straight to the application side without really understanding how things work behind the scenes. What’s our objective with any machine learning model, anyway? You might say, “To make accurate predictions.” Fair enough.

But how do you actually tell your model, “You’re close” or “You’re way off”? How does it know it made a mistake — and by how much?

That’s where loss functions come in.

Read on to learn what loss functions are, how they work, and when you might want to choose each.

Comments closed

Extending caret for Spatial Machine Learning

Jan Linnenbrink looks at spatial data:

This document shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.

When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.

Click through to see how it all works. H/T R-Bloggers.

Comments closed

The Dual Perils of Overfitting and Data Leakage

John Mount shares notes on a theme:

One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.

Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.

Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.

Comments closed

Model Diagnostics for Statistics vs Machine Learning

Christian Lorentzen talks diagnostics:

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999.

Read on to learn more about this dataset and how the mindset differs if you’re thinking about inference versus prediction.

Comments closed

Breaking down the Limitations of R^2

M. Fatih Tüzen explains an important regression concept:

When building a statistical model, one of the first numbers analysts and data scientists often cite is the , or coefficient of determination. It’s widely reported in research, academic theses, and industry reports — and yet, frequently misunderstood or misused.

Does a high R² mean your model is good? Is it enough to evaluate model performance? What about its adjusted or predictive counterparts?

Read on to learn the answers to each question. H/T R-Bloggers.

Comments closed