Press "Enter" to skip to content

Category: R

Using strsplit() with Multiple Delimiters in R

Steven Sanderson shows off some more complex string splitting scenarios in R:

In data preprocessing and text manipulation tasks, the strsplit() function in R is incredibly useful for splitting strings based on specific delimiters. However, what if you need to split a string using multiple delimiters? This is where strsplit() can really shine by allowing you to specify a regular expression that defines these delimiters. In this blog post, we’ll dive into how you can use strsplit() effectively with multiple delimiters to parse strings in your data.

Read on for two examples of complex scenarios.

Comments closed

Evenly Spacing Month Charts in ggplot2

Jameson Marriott fixes a spacing issue:

I recently noticed that ggplot2 spaces date axes literally even when grouped by month. I’ve been using ggplot2 extensively for years and I don’t remember noticing before, so this is not really a big deal, but now that I know it bugs me a lot. Take a look below.

I don’t think I had noticed this before either, though now that Jameson has pointed it out, it certainly is annoying. H/T R-Bloggers.

Comments closed

Dropping Data Frame Columns in R

Steven Sanderson performs feature selection:

As an R programmer, one of the fundamental tasks you’ll encounter is manipulating data frames. Whether you’re cleaning messy data or preparing it for analysis, knowing how to drop unnecessary columns is a valuable skill. In this guide, we’ll walk through the process of dropping columns from data frames in R, using simple examples to demystify the process.

Read on for three ways of doing this.

Comments closed

Selecting the Top N Values by Group in R

Steven Sanderson searches for subsets:

In data analysis, there often arises a need to extract the top N values within each group of a dataset. Whether you’re dealing with sales data, survey responses, or any other type of grouped data, identifying the top performers or outliers within each group can provide valuable insights. In this tutorial, we’ll explore how to accomplish this task using three popular R packages: dplyr, data.table, and base R. By the end of this guide, you’ll have a solid understanding of various approaches to selecting top N values by group in R.

Read on for the three examples.

Comments closed

Updates in R 4.4.0

Russ Hyde shares items of interest:

R 4.4.0 (“Puppy Cup”) was released on the 24th April 2024 and it is a beauty. In time-honoured tradition, here we summarise some of the changes that caught our eyes. R 4.4.0 introduces some cool features (one of which is experimental) and makes one of our favourite {rlang} operators available in base R. There are a few things you might need to be aware of regarding handling NULL and complex values.

The full changelog can be found at the r-release ‘NEWS’ page and if you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

Read on for a big note on tail-call recursion, an operator for coalescing, and a few more neat features.

Comments closed

Getting the Nth Row from a Data Frame

Steven Sanderson takes a slice:

Explanation: – We use nrow(my_df) to get the total number of rows in the data frame. – Then, we use indexing ([nrow(my_df), ]) to extract the last row.

Read on for a simple example of getting the last row using base R, dplyr, and data.table. Then, we kick it up a notch to get the second-to-last row, with the idea being that you could substitute “second-to-last” with any arbitrary number.

Comments closed

Specifying Follow-Up Times for Longitudinal Data in simstudy

Keith Goldfield updates the simstudy package:

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are {0,1,2,…,n−1}{0,1,2,…,n−1}, where there are n total periods. However, when follow-up periods required more specificity, such as {0,90,180,365}{0,90,180,365} days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

Read on to see how it works. H/T R-Bloggers.

Comments closed

Estimating Chi-Square Parameters with R

Steven Sanderson performs a test:

In the world of statistics and data analysis, understanding and accurately estimating the parameters of probability distributions is crucial. One such distribution is the chi-square distribution, often encountered in various statistical analyses. In this blog post, we’ll dive into how we can estimate the degrees of freedom (“df”) and the non-centrality parameter (“ncp”) of a chi-square distribution using R programming language.

Read on to learn more about the process of estimation while I grumble something about Bayesian analysis being better.

Comments closed

Regular Expressions in R

Steven Sanderson now has two problems:

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1x2x3x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Click through for a worked-out example.

Comments closed