Press "Enter" to skip to content

Category: R

Random Walks in R with RandomWalker

Steven Sanderson is going for a walk (not the after-dinner kind):

Welcome to the world of ‘RandomWalker’, an innovative R package designed to simplify the creation of various types of random walks. Developed by myself and my co-author, Antti Rask, this package is in its experimental phase but promises to be a powerful tool for statisticians, data scientists, and financial analysts alike. With a focus on Tidyverse compatibility, ‘RandomWalker’ aims to integrate seamlessly into your data analysis workflows, offering both automatic and customizable random walk generation.

Read on to learn more about the package, including why you might want to use it and the functionality you can get out of it.

Comments closed

Printing a Table in R via table()

Steven Sanderson builds a table:

Tables are an essential part of data analysis, serving as a powerful tool to summarize and interpret data. In R, the table() function is a versatile tool for creating frequency and contingency tables. This guide will walk you through the basics and some advanced applications of the table() function, helping you understand its usage with clear examples.

Click through for more information and several examples.

Comments closed

The Importance of Versioning Data

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

  • past schedules
  • past (recorded) attendance
  • future schedules, and
  • (estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.

Comments closed

Working with lapply() in R

Steven Sanderson applies a function:

R is a powerful programming language primarily used for statistical computing and data analysis. Among its many features, the lapply() function stands out as a versatile tool for simplifying code and reducing redundancy. Whether you’re working with lists, vectors, or data frames, understanding how to use lapply() effectively can greatly enhance your programming efficiency. For beginners, mastering lapply() is a crucial step in becoming proficient in R.

Read on to see how lapply() works.

Comments closed

Sampling without Replacement and Unequal Probabilities

Peter Ellis finds interesting results with sampling in R:

A week ago I was surprised to read on Thomas Lumley’s Biased and Inefficient blog that when using R’s sample() function without replacement and with unequal probabilities of individual units being sampled:

“What R currently has is sequential sampling: if you give it a set of priorities w it will sample an element with probability proportional to w from the population, remove it from the population, then sample with probability proportional to w from the remaining elements, and so on. This is useful, but a lot of people don’t realise that the probability of element i being sampled is not proportional to w_i”

Read on for a demonstration. H/T R-Bloggers.

Comments closed

Explaining a Causal Forest

Michael Mayer wants to suss out the effects of inputs into a causal forest model:

We use a causal forest [1] to model the treatment effect in a randomized controlled clinical trial. Then, we explain this black-box model with usual explainability tools. These will reveal segments where the treatment works better or worse, just like a forest plot, but multivariately.

Read on for the example, as well as several mechanisms you can use to gauge feature relevance.

Comments closed

Random Forest Missing Data Imputation using missRanger

Michael Mayer handles missing data:

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

This looks like an interesting package. At first, I thought it was a way of generating predictions outside the boundaries of training data and had concerns—a classic point (limitation?) of random forest as an algorithm is that it will not even try to predict values outside the range of what it sees in training data, so if the largest label is 10 and the smallest is 0, you won’t see a prediction of 11 or 50, no matter how you scale the inputs.

Instead of doing that, missRanger looks like it’s filling in missing data using a clever approach. That’s quite useful for dealing with incomplete data, a really common problem whose good solutions tend to be complex enough that people typically ignore them in favor of simple but less useful solutions like dropping rows altogether.

Comments closed

Comparing grep() and grepl() in R

Steven Sanderson compares two functions:

Both grep() and grepl() are functions in R that help us search for patterns in text. Think of them as detectives looking for clues in a big pile of words!

  • grep(): This function is like a pointer. It tells you where it found the pattern you’re looking for.
  • grepl(): This one is more like a yes/no checker. It tells you if the pattern exists or not.

Read on for examples of each.

Comments closed

Searching for Multiple Patterns in R with grepl

Steven Sanderson looks for the pattern:

Hello, fellow useRs! Today, we’re going to expand on previous uses of the grepl() function where we looked for a single pattern and move onto to a search for multiple patterns within strings. Whether you’re cleaning data, conducting text analysis, grepl can be your go-to tool. Let’s break down the syntax, offer a practical example, and guide you on a path to proficiency.

Read on for all of that.

Comments closed

Loops in R

Ben Johnston spins in circles:

Welcome back to my R for SEO series. We’re in the home stretch now, with part seven. Today, we’re going to be looking at different ways that we can run functions or commands over a series of elements using the various kinds of loops that exist in R.

If you’ve followed along so far, or you’ve tried some experimentation of your own, you’ve probably encountered loops and applys along the way. I know early on in my R journey, it very much seemed like pot luck as to which apply I should use, or whether a loop was easier, so hopefully today’s piece will start to clear that up for you a little.

I know that most programming courses cover these elements earlier, but for me, it really didn’t click until I’d learned more about the other areas we’ve covered in this series, so that’s why I’ve placed it here.

Read on for examples of For loops and While loops, as well as breaking conditions.

Ben also talks about loops versus using the apply() series of functions (or equivalent map() functions in the purrr library). I tend to lean heavily on using the mapping function approach when there are no side effects, and use for loops when there are. H/T R-Bloggers.

Comments closed