Press "Enter" to skip to content

Category: R

Getting the Nth Row from a Data Frame

Steven Sanderson takes a slice:

Explanation: – We use nrow(my_df) to get the total number of rows in the data frame. – Then, we use indexing ([nrow(my_df), ]) to extract the last row.

Read on for a simple example of getting the last row using base R, dplyr, and data.table. Then, we kick it up a notch to get the second-to-last row, with the idea being that you could substitute “second-to-last” with any arbitrary number.

Comments closed

Specifying Follow-Up Times for Longitudinal Data in simstudy

Keith Goldfield updates the simstudy package:

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are {0,1,2,…,n−1}{0,1,2,…,n−1}, where there are n total periods. However, when follow-up periods required more specificity, such as {0,90,180,365}{0,90,180,365} days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

Read on to see how it works. H/T R-Bloggers.

Comments closed

Estimating Chi-Square Parameters with R

Steven Sanderson performs a test:

In the world of statistics and data analysis, understanding and accurately estimating the parameters of probability distributions is crucial. One such distribution is the chi-square distribution, often encountered in various statistical analyses. In this blog post, we’ll dive into how we can estimate the degrees of freedom (“df”) and the non-centrality parameter (“ncp”) of a chi-square distribution using R programming language.

Read on to learn more about the process of estimation while I grumble something about Bayesian analysis being better.

Comments closed

Regular Expressions in R

Steven Sanderson now has two problems:

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1x2x3x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Click through for a worked-out example.

Comments closed

Selecting by Index in R

Steven Sanderson grabs some rows:

Imagine you want to analyze the fuel efficiency (miles per gallon) of a particular car. Here’s how to grab a single row by its index (row number):

Read on for that example, as well as examples covering two other cases: multiple rows based on index value, and ranges of indices.

Comments closed

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Comments closed

The Performance of Various Tidy Wrappers

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.tablearrowpolars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Comments closed

Removing Rows with Missing Data in R

Steven Sanderson shows us three ways:

Handling missing values is a crucial aspect of data preprocessing in R. Often, datasets contain missing values, which can adversely affect the analysis or modeling process. One common task is to remove rows containing missing values entirely. In this tutorial, we’ll explore different methods to accomplish this task in R, catering to scenarios where we want to remove rows with either some or all missing values.

Click through for three ways to do this.

Comments closed