Press "Enter" to skip to content

Category: R

Creating Your Own ggplot2 Geom

Isabella Velasquez is feeling creative:

If you use ggplot2, you are probably used to creating plots with geom_line() and geom_point(). You may also have ventured into to the broader ggplot2 ecosystem to use geoms like geom_density_ridges() from ggridges or geom_signif() from ggsignif. But have you ever wondered how these extensions were created? Where did the authors figure out how to create a new geom? And, if the plot of your dreams doesn’t exist, how would you make your own?

Enter the exciting world of creating your own ggplot2 extensions.

The post looks a lot like a series of slides, and it takes you through the process of creating a new geom. H/T R-Bloggers.

Leave a Comment

Using Dygraphs in R

Thomas Williams builds a chart:

I also wanted to get a little interactive with my analysis, and came across Dygraphs for R https://rstudio.github.io/dygraphs/ which wraps the “venerable” (according to creator Dan Vanderkam https://github.com/danvk) javascript charting library of the same name, first released in 2006.

I used Dygraphs in an R script file (it can work equally well in R Markdown) to quickly chart my time series data, loaded from the CSV file. Dygraphs were simple to use, are a solid pick among other charting libraries and very functional for being free and open source.

Read on for a few examples of charts, as well as the entirety of Thomas’s code.

Leave a Comment

Generative Additive Models for Customer Lifetime Value Estimation

Nicholas Clark builds a GAM:

I typically work in quantitative ecology and molecular epidemiology, where we use statistical models to predict species distributions or disease transmission patterns. Recently though, I had an interesting conversation with a data science PhD student who mentioned they were applying GAMs to predict Customer Lifetime Value at a SaaS startup. This caught my attention because CLV prediction, as it turns out, faces remarkably similar statistical challenges to ecological forecasting: nonlinear relationships that saturate at biological or business limits, hierarchical structures where groups behave differently, and the need to balance model flexibility with interpretability for stakeholders who need to understand why the model makes certain predictions.

This is an interesting article and I had not thought of using a GAM for calculating Customer Lifetime Value. I used a much simpler technique the one time I calculated CLV in earnest. H/T R-Bloggers.

Comments closed

Inferential Statistics in Excel using R

Adam Gladstone does a bit of inference testing:

In the previous posts in this series (Using R in Excel) I have demonstrated some basic use-cases where using R in Excel is useful. Specifically we have looked at descriptive statisticslinear regressionforecasting, and calling Python. In this post, I am going to look at inferential statistics and how R can be used (in Excel) to perform some typical statistical tests. Excel provides many excellent facilities for data wrangling and analysis. However, for certain types of statistical data analysis the limitations of the built-in functions and the Analysis ToolPak is not sufficient, and R provides superior facilities.

Read on for a few examples of tests, though there are a huge number available in R itself as well as its ecosystem of packages.

Comments closed

Simulating the Monty Hall Problem in R

Jason Bryer takes us through a classic introductory problem to Bayesian statistics:

I find that when teaching statistics (and probability) it is often helpful to simulate data first in order to get an understanding of the problem. The Monty Hall problem recently came up in a class so I implemented a function to play the game.

The Monty Hall problem results from a game show, Let’s Make a Deal, hosted by Monty Hall. In this game, the player picks one of three doors. Behind one is a car, the other two are goats. After picking a door the player is shown the contents of one of the other two doors, which because the host knows the contents, is a goat. The question to the player: Do you switch your choice?

This is one of the biggest “aha!” moments in statistics, in the sense that it is not intuitively obvious and is easy to get wrong, but once you understand why it is true, it makes reasoning over time and knowledge changes easier. H/T R-Bloggers.

Comments closed

Testing in R with testthat

Aida Gjoka writes a test:

Testing is an important step when developing code in R or any other language. If you are a Python user, you can consider reading our previous blogs in pytest. Writing tests helps us make sure that the code is working as expected. In the R ecosystem, the testthat package is one of the most used frameworks. In this blog we will explore some of the main properties of {testthat} highlighting some of the most useful functions with some examples.

Read on to see how it works. This isn’t a mocking library, but rather an assertions-based testing library. And near the end, Aida includes an extra library that helps with plot testing.

Comments closed

Handling Missing Data in R

M. Fatih Tüzen fills in the gaps:

Data preprocessing is a cornerstone of any data analysis or machine learning pipeline. Raw data rarely comes in a form ready for direct analysis — it often requires cleaning, transformation, normalization, and careful handling of anomalies. Among these preprocessing tasks, dealing with missing data stands out as one of the most critical and unavoidable challenges.

Missing values appear in virtually every domain: surveys may have skipped questions, administrative registers might contain incomplete records, and clinical trials can suffer from dropout patients. Ignoring these gaps or handling them naively does not just reduce the amount of usable information; it can also introduce bias, decrease statistical power, and ultimately compromise the validity of conclusions. In other words, missing data is not just an inconvenience — it is a methodological problem that demands rigorous attention.

Quite often, we gloss over what to do with missing data when explaining or working through the data science process, in part because it’s a hard problem. This post digs into the specifics of the matter, taking us through eight separate methods. H/T R-Bloggers.

Comments closed