Press "Enter" to skip to content

Category: R

Extending caret for Spatial Machine Learning

Jan Linnenbrink looks at spatial data:

This document shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.

When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.

Click through to see how it all works. H/T R-Bloggers.

Comments closed

Using Multiple Scales with ggplot2 and ggnewscale

Zhenguo Zhang resets the scale:

In one ggplot figure, normally you can only use one scale for each aesthetic mapping. For example, if you use scale_color_manual() to set the color scale for a layer, you cannot use another scale_color_manual() for another layer, or set the color scale more then once in the function aes(). However, you can use the new_scale_color() function from the ggnewscale package to add a new scale for the same aesthetic mapping in different layers.

In this post, I will showcase how to use the new_scale_color() function to add two different color scales in a ggplot figure. The first scale will be for a discrete variable (e.g., number of cylinders), and the second scale will be for a continuous variable (e.g., density level).

Click through for the code and a demonstration.

Comments closed

Debugging R Code in Visual Studio Code

Yohann Mansiaux steps through the code:

We are going to see how to use these functions in VSCode, as well as introducing “breakpoints”. Breakpoints are markers that you can set in your code to pause execution at a specific line. This allows you to inspect the state of your code at that point and step through it line by line. They are very close to the browser() function. They can also be used in RStudio IDE, but I have to admit that I never used them.

Read on to see how VSCode fills the need when it comes to debugging code. H/T R-Bloggers.

Comments closed

Model Diagnostics for Statistics vs Machine Learning

Christian Lorentzen talks diagnostics:

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999.

Read on to learn more about this dataset and how the mindset differs if you’re thinking about inference versus prediction.

Comments closed

Breaking down the Limitations of R^2

M. Fatih Tüzen explains an important regression concept:

When building a statistical model, one of the first numbers analysts and data scientists often cite is the , or coefficient of determination. It’s widely reported in research, academic theses, and industry reports — and yet, frequently misunderstood or misused.

Does a high R² mean your model is good? Is it enough to evaluate model performance? What about its adjusted or predictive counterparts?

Read on to learn the answers to each question. H/T R-Bloggers.

Comments closed

Creating Error Bars in ggplot2

Zhenguo Zhang draws a chart:

Sometimes you may want to create a plot with the following features:

  • a point to indicate the mean of a group
  • error bars to indicate the standard deviation of the group
  • and each group may have subgroups, which are represented by different colors.

In this post, I will show you how to create such a plot using the ggplot2 package in R.

Read on for the demonstration, as well as fixing a common problem of overlapping data points. H/T R-Bloggers.

Comments closed

Function Generators versus Partial Application in R

Jonathan Carroll digs in:

The blog post (www.tidyverse.org) describing the latest updates to the tidyverse {scales} package neatly demonstrates the usage of the new functionality, but because the examples are written outside of actual plotting code, one feature stuck out to me in particular…

label_glue("The {x} penguin")(c("Gentoo", "Chinstrap", "Adelie"))
# The Gentoo penguin
# The Chinstrap penguin
# The Adelie penguin

Read on for a dive into what makes the actual invocation interesting. H/T R-Bloggers.

Comments closed

choroplethr 4.0.0 Now in CRAN

Ari Lamstein has an announcement:

With this version, I have transferred the maintenance of choroplethr to Zhaochen He, an economics professor at Christopher Newport University. Zhao addressed the issues that led to choroplethr being archived from CRAN in February. Please join me in thanking Zhao for his contribution!

Click through for the updates, as well as what Ari views as the current challenges for the project as he hands the project over Zhaochen He. H/T R-Bloggers.

Comments closed

Data Splitting and Cross-Validation in R

Nick Han has a pair of articles. First up is on data splitting and pre-processing:

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

H/T R-Bloggers for that one.

The second involves using cross-validation via the caret package in R:

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

H/T R-Bloggers for that as well.

Comments closed