R – Page 16 – Curated SQL

Regular Expressions in R

Published 2024-04-15 by Kevin Feasel

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1, x2, x3, x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Click through for a worked-out example.

Comments closed

Selecting by Index in R

Published 2024-04-12 by Kevin Feasel

Steven Sanderson grabs some rows:

Imagine you want to analyze the fuel efficiency (miles per gallon) of a particular car. Here’s how to grab a single row by its index (row number):

Read on for that example, as well as examples covering two other cases: multiple rows based on index value, and ranges of indices.

Comments closed

The Performance of Various Tidy Wrappers

Published 2024-04-11 by Kevin Feasel

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.table, arrow, polars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Comments closed

Removing Multiple Rows from a DataFrame via Base R

Published 2024-04-11 by Kevin Feasel

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Comments closed

Removing Rows with Missing Data in R

Published 2024-04-10 by Kevin Feasel

Steven Sanderson shows us three ways:

Handling missing values is a crucial aspect of data preprocessing in R. Often, datasets contain missing values, which can adversely affect the analysis or modeling process. One common task is to remove rows containing missing values entirely. In this tutorial, we’ll explore different methods to accomplish this task in R, catering to scenarios where we want to remove rows with either some or all missing values.

Click through for three ways to do this.

Comments closed

Replicating a Product Chart with ggplot2

Published 2024-04-09 by Kevin Feasel

Mauricio Vargas Sepúlveda makes some tea:

Tetley tea boxes feature the following caffeine meter:

In R we can replicate this meter using ggplot2.

I enjoy this kind of thing because, even though the end result doesn’t look exactly like the chart on the tea box, it’s close enough to appreciate the effort. H/T R-Bloggers.

Comments closed

Multi-Column Joins in R

Published 2024-04-08 by Kevin Feasel

Steven Sanderson joins on multiple columns:

Let’s start with a simple scenario. You have two data frames, and you want to merge them based on two columns: ID and Year. The goal is to combine the data where the ID and Year values match in both data frames.

All of the examples here use the merge() function, so check them out.

Comments closed

Multidimensional Scaling in R

Published 2024-04-05 by Kevin Feasel

Steven Sanderson is from the 5th dimension:

Visualizing similarities between data points can be tricky, especially when dealing with many features. This is where multidimensional scaling (MDS) comes in handy. It allows us to explore these relationships in a lower-dimensional space, typically 2D or 3D for easier interpretation. In R, the cmdscale() function from base R and is a great tool for performing classical MDS.

Click through to see how this works. In case you’re curious, cmdscale() is an example of principal coordinates analysis. If you’re familiar with principal components analysis, that’s a different form of multidimensional scaling.

Comments closed

Tips for Dealing with Large Spatial Datasets

Published 2024-04-05 by Kevin Feasel

Rhian Davies consults the map:

I love playing with spatial data. Perhaps because I enjoy exploring the outdoors, or because I spend hours playing Geoguessr, or maybe it’s just because maps are pretty but there’s nothing more fun than tinkering with location data.

However, reading in spatial data, especially large data sets can sometimes be a pain. Here are some simple things to consider when working in spatial data in R and breaking large data sets into more manageable chunks.

Click through for three tips when dealing with spatial data. The code is in R but the tips make sense in any language.

Comments closed

Normalizing Data in R

Published 2024-04-03 by Kevin Feasel

Steven Sanderson says, act normal:

Data normalization is a crucial preprocessing step in data analysis and machine learning workflows. It helps in standardizing the scale of numeric features, ensuring fair treatment to all variables regardless of their magnitude. In this tutorial, we’ll explore how to normalize data in R using practical examples and step-by-step explanations.

Read on for a definition of what this means and how you can do it.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: R