R – Page 7 – Curated SQL

Grouping Rows in R

Published 2024-10-08 by Kevin Feasel

Steven Sanderson needs a GROUP BY clause:

Combining rows with the same column values is a fundamental task in data analysis and manipulation, especially when handling large datasets. This guide is tailored for beginner R programmers looking to efficiently merge rows using Base R, the dplyr package, and the data.table package. By the end of this guide, you will be able to seamlessly aggregate data in R, enhancing your data analysis capabilities.

Click through for several code examples.

Comments closed

Splitting Data into Equally-Sized Groups in R

Published 2024-10-04 by Kevin Feasel

Steven Sanderson splits out some data:

As a beginner R programmer, you’ll often encounter situations where you need to divide your data into equal-sized groups. This process is crucial for various data analysis tasks, including cross-validation, creating balanced datasets, and performing group-wise operations. In this comprehensive guide, we’ll explore multiple methods to split data into equal-sized groups using different R packages and approaches.

Click through for methods and examples.

Comments closed

Creating Horizontal Box Plots in R

Published 2024-09-27 by Kevin Feasel

Steven Sanderson tips the box plot over:

A boxplot, also known as a whisker plot, displays the distribution of data based on a five-number summary: minimum, first quartile, median, third quartile, and maximum. It highlights the data’s central tendency and variability, making it easier to identify outliers.

Read on for examples in base R and ggplot2.

Comments closed

Simple Outlier Detection and Removal in R

Published 2024-09-25 by Kevin Feasel

Steven Sanderson looks for oddities:

Outliers can significantly skew your data analysis results, leading to inaccurate conclusions. For R programmers, effectively identifying and removing outliers is crucial for maintaining data integrity. This guide will walk you through various methods to handle outliers in R, focusing on multiple columns, using a synthetic dataset for demonstration.

The techniques Steven uses are perfectly reasonable (though I like to use MAD from the median rather than standard deviations from the mean because MAD from the median doesn’t suffer from the sorts of endogeneity problems standard deviation does in a dynamic process). My primary warning would be to keep outliers in a dataset unless you know why you’re removing them. If you know the values were impossible or wrong—for example, a person who works 500 hours a week—that’s one thing. But sometimes, you get exceptional values out of an ordinary process, and those values are just as real as any other. I might have had a sequence in which I flipped a fair coin and it landed on heads 10 times in a row. It’s statistically very uncommon, but that doesn’t mean you can ignore it as a possibility or pretend it didn’t happen.

Comments closed

Random Walks in R with RandomWalker

Published 2024-09-17 by Kevin Feasel

Steven Sanderson is going for a walk (not the after-dinner kind):

Welcome to the world of ‘RandomWalker’, an innovative R package designed to simplify the creation of various types of random walks. Developed by myself and my co-author, Antti Rask, this package is in its experimental phase but promises to be a powerful tool for statisticians, data scientists, and financial analysts alike. With a focus on Tidyverse compatibility, ‘RandomWalker’ aims to integrate seamlessly into your data analysis workflows, offering both automatic and customizable random walk generation.

Read on to learn more about the package, including why you might want to use it and the functionality you can get out of it.

Comments closed

Printing a Table in R via table()

Published 2024-09-13 by Kevin Feasel

Steven Sanderson builds a table:

Tables are an essential part of data analysis, serving as a powerful tool to summarize and interpret data. In R, the table() function is a versatile tool for creating frequency and contingency tables. This guide will walk you through the basics and some advanced applications of the table() function, helping you understand its usage with clear examples.

Click through for more information and several examples.

Comments closed

The Importance of Versioning Data

Published 2024-09-12 by Kevin Feasel

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

past schedules

past (recorded) attendance

future schedules, and

(estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.

Comments closed

Working with lapply() in R

Published 2024-09-12 by Kevin Feasel

Steven Sanderson applies a function:

R is a powerful programming language primarily used for statistical computing and data analysis. Among its many features, the lapply() function stands out as a versatile tool for simplifying code and reducing redundancy. Whether you’re working with lists, vectors, or data frames, understanding how to use lapply() effectively can greatly enhance your programming efficiency. For beginners, mastering lapply() is a crucial step in becoming proficient in R.

Read on to see how lapply() works.

Comments closed

Sampling without Replacement and Unequal Probabilities

Published 2024-09-04 by Kevin Feasel

Peter Ellis finds interesting results with sampling in R:

A week ago I was surprised to read on Thomas Lumley’s Biased and Inefficient blog that when using R’s sample() function without replacement and with unequal probabilities of individual units being sampled:

“What R currently has is sequential sampling: if you give it a set of priorities w it will sample an element with probability proportional to w from the population, remove it from the population, then sample with probability proportional to w from the remaining elements, and so on. This is useful, but a lot of people don’t realise that the probability of element i being sampled is not proportional to w_i”

Read on for a demonstration. H/T R-Bloggers.

Comments closed

Explaining a Causal Forest

Published 2024-09-03 by Kevin Feasel

Michael Mayer wants to suss out the effects of inputs into a causal forest model:

We use a causal forest [1] to model the treatment effect in a randomized controlled clinical trial. Then, we explain this black-box model with usual explainability tools. These will reveal segments where the treatment works better or worse, just like a forest plot, but multivariately.

Read on for the example, as well as several mechanisms you can use to gauge feature relevance.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R