Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Related Posts

Defining TF-IDF

Bruno Stecanella explains the concept behind TF-IDF: TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank […]

Read More

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data: A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

May 2019
MTWTFSS
« Apr  
 12345
6789101112
13141516171819
20212223242526
2728293031