R – Page 16 – Curated SQL

Collapsing or Concatenating Text in R

Published 2024-05-10 by Kevin Feasel

When working with data frames in R, you may often encounter scenarios where you need to collapse or concatenate text values based on groups within your dataset. This could involve combining text from multiple rows into a single row per group, which can be useful for summarizing data or preparing it for further analysis. In this post, we’ll explore how to achieve this task using different methods in R—specifically using base R, the dplyr package, and the data.table package.

This is the R equivalent of T-SQL’s STRING_AGG() function, or the STUFF() + FOR XML PATH approach if you’re still on an older version of SQL Server.

Comments closed

Counting NA Values in R

Published 2024-05-09 by Kevin Feasel

Steven Sanderson counts what doesn’t exist:

Welcome back, R enthusiasts! Today, we’re going to explore a fundamental task in data analysis: counting the number of missing (NA) values in each column of a dataset. This might seem straightforward, but there are different ways to achieve this using different packages and methods in R.

Let’s dive right in and compare how to accomplish this task using base R, dplyr, and data.table. Each method has its own strengths and can cater to different preferences and data handling scenarios.

Read on for 3 1/2 separate methods.

Comments closed

Model Selection with AIC

Published 2024-05-07 by Kevin Feasel

Steven Sanderson talks about the Akaike Information Criterion:

In the world of data analysis and statistics, one of the key challenges is selecting the best model to describe and analyze your data. This decision is crucial because it impacts the accuracy and reliability of your results. Among the many tools available, the Akaike Information Criterion (AIC) stands out as a powerful method for comparing different models and choosing the most suitable one.

Today we will go through an example of model selection using the AIC, specifically focusing on its application to various statistical distributions available in the TidyDensity package. TidyDensity, a part of the healthyverse ecosystem, offers a comprehensive suite of tools for data analysis in R, including functions to compute AIC scores for different probability distributions.

Read on for a quick primer on the AIC itself and how you can use it in TidyDensity.

Comments closed

MCMC Sampling with TidyDensity

Published 2024-05-06 by Kevin Feasel

Steven Sanderson performs some sampling:

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Read on for a brief primer on MCMC and an example of how the tidy_mcmc_sampling() function works.

Comments closed

Checking for Duplicate Rows with TidyDensity

Published 2024-05-03 by Kevin Feasel

Steven Sanderson looks for dupes:

Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Read on to see how it works. Though I am curious about whether there’s an option to ignore certain columns, such as row IDs or other “non-essential” columns you don’t want to include for comparison. Also, checking how it handles NA or NULL would be interesting.

Comments closed

Quantile Normalization with TidyDensity

Published 2024-05-02 by Kevin Feasel

Steven Sanderson achieves normality:

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

Read on to see how you can use the TidyDensity package to pull this off.

Comments closed

Using strsplit() with Multiple Delimiters in R

Published 2024-04-30 by Kevin Feasel

Steven Sanderson shows off some more complex string splitting scenarios in R:

In data preprocessing and text manipulation tasks, the strsplit() function in R is incredibly useful for splitting strings based on specific delimiters. However, what if you need to split a string using multiple delimiters? This is where strsplit() can really shine by allowing you to specify a regular expression that defines these delimiters. In this blog post, we’ll dive into how you can use strsplit() effectively with multiple delimiters to parse strings in your data.

Read on for two examples of complex scenarios.

Comments closed

Evenly Spacing Month Charts in ggplot2

Published 2024-04-29 by Kevin Feasel

Jameson Marriott fixes a spacing issue:

I recently noticed that ggplot2 spaces date axes literally even when grouped by month. I’ve been using ggplot2 extensively for years and I don’t remember noticing before, so this is not really a big deal, but now that I know it bugs me a lot. Take a look below.

I don’t think I had noticed this before either, though now that Jameson has pointed it out, it certainly is annoying. H/T R-Bloggers.

Comments closed

Dropping Data Frame Columns in R

Published 2024-04-26 by Kevin Feasel

Steven Sanderson performs feature selection:

As an R programmer, one of the fundamental tasks you’ll encounter is manipulating data frames. Whether you’re cleaning messy data or preparing it for analysis, knowing how to drop unnecessary columns is a valuable skill. In this guide, we’ll walk through the process of dropping columns from data frames in R, using simple examples to demystify the process.

Read on for three ways of doing this.

Comments closed

Selecting the Top N Values by Group in R

Published 2024-04-25 by Kevin Feasel

Steven Sanderson searches for subsets:

In data analysis, there often arises a need to extract the top N values within each group of a dataset. Whether you’re dealing with sales data, survey responses, or any other type of grouped data, identifying the top performers or outliers within each group can provide valuable insights. In this tutorial, we’ll explore how to accomplish this task using three popular R packages: dplyr, data.table, and base R. By the end of this guide, you’ll have a solid understanding of various approaches to selecting top N values by group in R.

Read on for the three examples.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Category: R