Press "Enter" to skip to content

Category: R

R’s Global Regular Expression Function

Steven Sanderson has me wondering who Greg is and why he gets an expression of his own:

If you’ve ever worked with text data in R, you know how important it is to have powerful tools for pattern matching. One such tool is the gregexpr() function. This function is incredibly useful when you need to find all occurrences of a pattern within a string. Today, we’ll go into how gregexpr() works, explore its syntax, and go through several examples to make things clear.

Read on to learn more about the global regular expression function and how it works.

Comments closed

Counting Words in a String in R

Steven Sanderson counts the ways:

Counting words in a string is a common task in data manipulation and text analysis. Whether you’re parsing tweets, analyzing survey responses, or processing any textual data, knowing how to count words is crucial. In this post, we’ll explore three ways to achieve this in R: using base R’s strsplit(), the stringr package, and the stringi package. We’ll provide clear examples and explanations to help you get started.

I, of course, would commission a 128-node Hadoop cluster and write a few dozen pages of Java code to get the answer.

Comments closed

Making Code Developer Friendly with an Example in R

Mark Niemann-Ross says the rest is commentary:

If you are reading this, you’re a coder and use functions. We write them for ourselves. If someone else writes a function, you can hope it works. If it doesn’t, you can hope to fix it. Hopefully, the return value is obviously correct. But maybe it’s subtly wrong?

If things are amiss, read the name of the function and hope it’s descriptive. I worked with a programmer who omitted all vowels from his function names. So the above code would expand to this…

Read on for the rationale behind commenting your functions appropriately, as well as one way to do it in R. There is a bit of art and a bit of science to writing good comments, but the starting point is simply having them to begin with. And the more clever you feel like you’re being, the more you need to comment this, because three months from now, you probably won’t be feeling quite as clever. H/T R-Bloggers.

Comments closed

Selecting Columns Containing a Specific String in R

Steven Sanderson goes hunting for strings:

Today I want to discuss a common task in data manipulation: selecting columns containing a specific string. Whether you’re working with base R or popular packages like stringrstringi, or dplyr, I’ll show you how to efficiently achieve this. We’ll cover various methods and provide clear examples to help you understand each approach. Let’s get started!

Click through for five examples across the three methods.

Comments closed

Checking if a Column Exists in an R Data Frame

Steven Sanderson takes a peek:

When working with data frames in R, it’s common to need to check whether a specific column exists. This is particularly useful in data cleaning and preprocessing, to ensure your scripts don’t throw errors if a column is missing. Today, we’ll explore several methods to perform this check efficiently in R, and I encourage you to try these methods out with your own data sets.

Read on for four ways to do this.

Comments closed

Checking if a Column Contains a String in R

Steven Sanderson performs a check:

Whether you’re doing some data cleaning or exploring your dataset, checking if a column contains a specific string can be a crucial task. Today, I’ll show you how to do this using both str_detect() from the stringr package and base R methods. We’ll also tackle finding partial strings and counting occurrences. Let’s dive right in!

Read on for a few variants on the theme.

Comments closed

Collapsing or Concatenating Text in R

Steven Sanderson builds a list:

When working with data frames in R, you may often encounter scenarios where you need to collapse or concatenate text values based on groups within your dataset. This could involve combining text from multiple rows into a single row per group, which can be useful for summarizing data or preparing it for further analysis. In this post, we’ll explore how to achieve this task using different methods in R—specifically using base R, the dplyr package, and the data.table package.

This is the R equivalent of T-SQL’s STRING_AGG() function, or the STUFF() + FOR XML PATH approach if you’re still on an older version of SQL Server.

Comments closed

Counting NA Values in R

Steven Sanderson counts what doesn’t exist:

Welcome back, R enthusiasts! Today, we’re going to explore a fundamental task in data analysis: counting the number of missing (NA) values in each column of a dataset. This might seem straightforward, but there are different ways to achieve this using different packages and methods in R.

Let’s dive right in and compare how to accomplish this task using base R, dplyr, and data.table. Each method has its own strengths and can cater to different preferences and data handling scenarios.

Read on for 3 1/2 separate methods.

Comments closed

Model Selection with AIC

Steven Sanderson talks about the Akaike Information Criterion:

In the world of data analysis and statistics, one of the key challenges is selecting the best model to describe and analyze your data. This decision is crucial because it impacts the accuracy and reliability of your results. Among the many tools available, the Akaike Information Criterion (AIC) stands out as a powerful method for comparing different models and choosing the most suitable one.

Today we will go through an example of model selection using the AIC, specifically focusing on its application to various statistical distributions available in the TidyDensity package. TidyDensity, a part of the healthyverse ecosystem, offers a comprehensive suite of tools for data analysis in R, including functions to compute AIC scores for different probability distributions.

Read on for a quick primer on the AIC itself and how you can use it in TidyDensity.

Comments closed

MCMC Sampling with TidyDensity

Steven Sanderson performs some sampling:

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Read on for a brief primer on MCMC and an example of how the tidy_mcmc_sampling() function works.

Comments closed