Press "Enter" to skip to content

Category: R

Non-Equi Joins in R

David Selby walks us through non-trivial join scenarios in R:

Most joins are equi-joins, matching rows according to two columns having exactly equal values. These are easy to perfom in R using the base merge() function, the various join() functions in dplyr and the X[i] syntax of data.table.

But sometimes we need non-equi joins or θ-joins, where the matching condition is an interval or a set of inequalities. Other situations call for a rolling join, used to link records according to their proximity in a time sequence.

How do you perform non-equi joins and rolling joins in R?

Click through for the answer using dplyr, sqldf, and data.table. H/T R-bloggers

Comments closed

Polychoric Correlation in Practice

Jack Davis explains the concept of polychoric correlation:

In polychoric correlation, we don’t need to know or specify where the boundary between “good” and “very good” is, just that it exists. The distribution of the ordinal responses, along with the assumption that the latent values follow a normal distribution, is enough that the polychor() function in the polycor R package can do that for us. In most practical cases, you don’t even need to know where the cutoffs are, but they are useful for demonstration that the method works.

Polychoric correlation estimates the correlation between such latent variables as if you actually knew what those values were. In the examples given, we start with the latent variables and use cutoffs to set them into bins, and then use polychoric on the artificially binned data. In any practical use case, the latent data would be invisible to you, and the cutoffs would be determined by whoever designed the survey.

Read on for a demonstration of the process in R.

Comments closed

Using the Pipe Operator in ggplot2

Tomaz Kastrun reduces the number of pipe-like operators:

Using pipe %>% or chaining commands is very intuitive for creating following set of commands for data preparation. Visualization library ggplot in this manner uses sign “+” (plus) to do all the chaining. What if we would have to replace with the pipe sign?

This is because ggplot was developed prior to magrittr took over the piping world in R, so there wasn’t a “normal” pipe. I had been hopeful that ggvis would take over, as it does use the %>% pipe, but that project has gone dormant.

Comments closed

Image Sizing in RMarkdown Documents

The Jumping Rivers team shares some insight on image creation:

In this series of posts we’ll consider the (simple?) task of generating and including figures for the web using R & {knitr}. Originally this was going to be a single post, but as the length increase, we’ve decided to separate it into a separate articles. The four posts we intend to cover are

– setting the image size (this post)
– selecting the image type, PNG vs JPEG vs SVG
– including non-generated files in a document
– setting global {knitr} options.

Read on for the first post in the series.

Comments closed

Plotting Multiple Plots in R using map and ggplot

Sebastian Sauer gives us a quick solution to plotting one graph per variable:

Say we have a data frame where we would like to plot each numeric variables’s distribution.

There are a number of good solutions outthere such as this one, or here, or here.

When I read this, my first thought was along the lines of, “Why not use facets or something like cowplot?” But then it clicked that this is per-variable plotting, whereas faceting requires you choose a variable and see the plots based on that variable’s distinct values..

Comments closed

Counting Open Lockers in R

Holger von Jouanne-Diedrich solves a riddle:

We are standing in front of 100 lockers arranged side by side, all of which are closed. One man has a bunch of keys with all 100 keys and will pass the lockers exactly a hundred times, opening or closing some of them.

On the first pass, he opens all the lockers. On the second pass, the man will go to every other locker and change its state. That means: If it is closed, it will be opened. If it is already open, it will be closed. In this case, he closes lockers 2, 4, 6… 98 and 100, because all doors were open before.

On the third pass, he changes the state of every third locker – that is, 3, 6, 9, … 96, 99. Closed doors are opened, open doors closed. In the fourth pass, every fourth locker is changed, at the fifth every fifth – and so on. At the last, the 100th, the man finally only changes the state of door number 100.

The question is: How many of the 100 compartments are open after the 100th pass?

Click through for one solution in R.

Comments closed

Sunflower Plots in R

Kenneth Tay takes a look at a sunflower plot:

sunflower plot is a type of scatterplot which tries to reduce overplotting. When there are multiple points that have the same (x, y) values, sunflower plots plot just one point there, but has little edges (or “petals”) coming out from the point to indicate how many points are really there.

My first thought on it is that it’s too busy and doesn’t do its job of portraying a mass of data points very well. When you have just a few observations, then yeah, it’s not too bad. But once you have any reasonable amount of density on the plot, it’s better to use jitter and transparency (as Kenneth points out). H/T R-bloggers

Comments closed

Reporting on Correlation Analysis in R

Petr Baranovskiy continues a series on correlation analysis using R:

This is the second part of the Correlation Analysis in R series. In this post, I will provide an overview of some of the packages and functions used to perform correlation analysis in R, and will then address reporting and visualizing correlations as text, tables, and correlation matrices in online and print publications.

Read the whole thing.

Comments closed

Model Post-Processing with insight

The easystats team talks about the insight package in R:

We are talking about the insight package. It is what allows other packages, like easystats (parameterseffectsizeperformancereport, …) or ggstatsplotsjstats or modelsummary to be as powerful as they are, supporting tons of different R models. So why make you life hard when you can be like them, and rely on insight?

It is made for developers (and users) that do some postprocessing of different models (e.g., extracting stuff like parameters, values, data, names, specifications, predictions, priors, etc.), whether it is to nicely display their results or to do further computation.

Click through for an example of what it does and how it works. H/T R-bloggers

Comments closed

Determining a Good Test Set Size

John Mount thinks about test set size:

In this note we will answer “what is a good test set size?” three ways.

– The usual practical answer.
– A decision theory answer.
– A novel variational answer.

Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.

My rule of thumb is that I want it to be as small as possible while containing the highest likelihood of hitting all real-world scenarios enough times to provide a valid comparison. This conversely maximizes the size of the training data set, giving us the best chance of seeing the widest variety of scenarios we can during the formative phase.

And as usual, John goes way deeper than my rules of thumb. I like this post a lot.

Comments closed