Press "Enter" to skip to content

Category: R

Specifying Follow-Up Times for Longitudinal Data in simstudy

Keith Goldfield updates the simstudy package:

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are {0,1,2,…,n−1}{0,1,2,…,n−1}, where there are n total periods. However, when follow-up periods required more specificity, such as {0,90,180,365}{0,90,180,365} days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

Read on to see how it works. H/T R-Bloggers.

Leave a Comment

Estimating Chi-Square Parameters with R

Steven Sanderson performs a test:

In the world of statistics and data analysis, understanding and accurately estimating the parameters of probability distributions is crucial. One such distribution is the chi-square distribution, often encountered in various statistical analyses. In this blog post, we’ll dive into how we can estimate the degrees of freedom (“df”) and the non-centrality parameter (“ncp”) of a chi-square distribution using R programming language.

Read on to learn more about the process of estimation while I grumble something about Bayesian analysis being better.

Leave a Comment

Regular Expressions in R

Steven Sanderson now has two problems:

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1x2x3x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Click through for a worked-out example.

Leave a Comment

The Performance of Various Tidy Wrappers

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.tablearrowpolars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Leave a Comment

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Leave a Comment

Removing Rows with Missing Data in R

Steven Sanderson shows us three ways:

Handling missing values is a crucial aspect of data preprocessing in R. Often, datasets contain missing values, which can adversely affect the analysis or modeling process. One common task is to remove rows containing missing values entirely. In this tutorial, we’ll explore different methods to accomplish this task in R, catering to scenarios where we want to remove rows with either some or all missing values.

Click through for three ways to do this.

Leave a Comment

Multidimensional Scaling in R

Steven Sanderson is from the 5th dimension:

Visualizing similarities between data points can be tricky, especially when dealing with many features. This is where multidimensional scaling (MDS) comes in handy. It allows us to explore these relationships in a lower-dimensional space, typically 2D or 3D for easier interpretation. In R, the cmdscale() function from base R and is a great tool for performing classical MDS.

Click through to see how this works. In case you’re curious, cmdscale() is an example of principal coordinates analysis. If you’re familiar with principal components analysis, that’s a different form of multidimensional scaling.

Leave a Comment