R – Page 7 – Curated SQL

Looping through Column Names in R

Published 2024-10-18 by Kevin Feasel

Looping through column names in R is a crucial technique for data manipulation, especially for beginners. This article will guide you through various methods to loop through column names in R, providing practical examples and insights to enhance your data analysis skills.

Read on for examples with for loops, the dynamic duo of lapply() and sapply(), and the map() function in the purrr library.

Comments closed

Adding a Suffix to a Column Name in R

Published 2024-10-15 by Kevin Feasel

Steven Sanderson renames a column or three:

When working with data frames in R, you might find yourself needing to modify column names to include additional information, such as a suffix. This can be particularly useful when merging datasets or when you want to ensure that column names are unique and descriptive.

Read on to see how.

Comments closed

Working with List Columns in R

Published 2024-10-14 by Kevin Feasel

Sebastian Sauer makes a list and checks it twice:

In this post, I want to show you how to work with list columns in R. List columns are a powerful feature of the tidyverse that allow you to store multiple objects in a single column of a data frame. This can be useful when you have a list of objects that you want to keep together, such as a list of data frames or a list of models.

Click through for a demo.

Comments closed

Combining Data Frames with Differing Columns in R

Published 2024-10-11 by Kevin Feasel

Steven Sanderson does a bit of merging:

Combining data frames is a fundamental task in data analysis, especially when dealing with datasets that have different structures. In R, there are several ways to achieve this, using base R functions, the dplyr package, and the data.table package. This guide will walk you through each method, providing examples and explanations suitable for beginner R programmers. This article will explore three primary methods in R: base R functions, dplyr, and data.table. Each method has its advantages, and understanding them will enhance your data manipulation skills.

There are quite a few examples here, depending on whether you intend to join the datasets or perform a set operation such as union or intersect.

Comments closed

Smoothing Functions in R

Published 2024-10-11 by Kevin Feasel

Ivan Svetunkov puts on the forecasting hat:

I have been asked recently by a colleague of mine how to extract the variance from a model estimated using adam() function from the smooth package in R. The problem was that that person started reading the source code of the forecast.adam() and got lost between the lines (this happens to me as well sometimes). Well, there is an easier solution, and in this post I want to summarise several methods that I have implemented in the smooth package for forecasting functions. In this post I will focus on the adam() function, although all of them work for es() and msarima() as well, and some of them work for other functions (at least as for now, for smooth v4.1.0). Also, some of them are mentioned in the Cheat sheet for adam() function of my monograph (available online).

Read on to learn more. H/T R-Bloggers.

Comments closed

Reading Parquet Files in R with nanoparquet

Published 2024-10-10 by Kevin Feasel

Stephen Turner reads some data:

In these slides I also learned about the nanoparquet package — a zero dependency package for reading and writing parquet files in R. Besides all the benefits noted above, parquet is much faster to read and write. And, as opposed to saving as .rds, parquet can easily be passed back and forth between R, Python, and other frameworks.

Let’s take a look at how reading and writing parquet files compares with CSV, either with base R or readr.

Stephen shows one of the best-case scenarios for Parquet: lots of data (100 million rows), relatively few columns, no long strings, etc. That leads to a massive improvement over using CSVs, even if you ignore the metadata and formatting benefits. I wouldn’t expect the benefits to be nearly as significant with wide text columns and very little value overlap, but that’s also pretty uncommon for the type of dataset we’re analyzing in R.

Comments closed

Grouping Rows in R

Published 2024-10-08 by Kevin Feasel

Steven Sanderson needs a GROUP BY clause:

Combining rows with the same column values is a fundamental task in data analysis and manipulation, especially when handling large datasets. This guide is tailored for beginner R programmers looking to efficiently merge rows using Base R, the dplyr package, and the data.table package. By the end of this guide, you will be able to seamlessly aggregate data in R, enhancing your data analysis capabilities.

Click through for several code examples.

Comments closed

Splitting Data into Equally-Sized Groups in R

Published 2024-10-04 by Kevin Feasel

Steven Sanderson splits out some data:

As a beginner R programmer, you’ll often encounter situations where you need to divide your data into equal-sized groups. This process is crucial for various data analysis tasks, including cross-validation, creating balanced datasets, and performing group-wise operations. In this comprehensive guide, we’ll explore multiple methods to split data into equal-sized groups using different R packages and approaches.

Click through for methods and examples.

Comments closed

Creating Horizontal Box Plots in R

Published 2024-09-27 by Kevin Feasel

Steven Sanderson tips the box plot over:

A boxplot, also known as a whisker plot, displays the distribution of data based on a five-number summary: minimum, first quartile, median, third quartile, and maximum. It highlights the data’s central tendency and variability, making it easier to identify outliers.

Read on for examples in base R and ggplot2.

Comments closed

Simple Outlier Detection and Removal in R

Published 2024-09-25 by Kevin Feasel

Steven Sanderson looks for oddities:

Outliers can significantly skew your data analysis results, leading to inaccurate conclusions. For R programmers, effectively identifying and removing outliers is crucial for maintaining data integrity. This guide will walk you through various methods to handle outliers in R, focusing on multiple columns, using a synthetic dataset for demonstration.

The techniques Steven uses are perfectly reasonable (though I like to use MAD from the median rather than standard deviations from the mean because MAD from the median doesn’t suffer from the sorts of endogeneity problems standard deviation does in a dynamic process). My primary warning would be to keep outliers in a dataset unless you know why you’re removing them. If you know the values were impossible or wrong—for example, a person who works 500 hours a week—that’s one thing. But sometimes, you get exceptional values out of an ordinary process, and those values are just as real as any other. I might have had a sequence in which I flipped a fair coin and it landed on heads 10 times in a row. It’s statistically very uncommon, but that doesn’t mean you can ignore it as a possibility or pretend it didn’t happen.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: R