Press "Enter" to skip to content

Category: R

Bootstrapping in TidyDensity

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed

Data Reading and Writing with arrow

Colin Gillespie performs two of the three R’s:

Apache Arrow is a cross-language development platform for in-memory data. As it’s in-memory (as opposed to data stored on disk), it provides additional speed boosts. It’s designed for efficient analytic operations, and uses a standardised language-independent columnar memory format for flat and hierarchical data. The {arrow} R package provides an interface to the ‘Arrow C++’ library – an efficient package for analytic operations on modern hardware.

There are many great tutorials on using {arrow} (see the links at the bottom of the post for example). The purpose of this blog post isn’t to simply reproduce a few examples, but to understand some of what’s happening behind the scenes. In this particular post, we’re interested in understanding the reading/writing aspects of {arrow}.

Read on to see it in action in R.

Comments closed

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

TidyDensity and data.table

Steven Sanderson makes use of data.table:

I’m thrilled to announce a major upgrade to the TidyDensity package that’s sure to accelerate your data analysis workflows. We’ve integrated the lightning-fast data.table package for generating tidy distribution data, resulting in a jaw-dropping 30% speed boost.

The data.table package is so much faster than its competition in so many cases, yet I really don’t like its syntax.

Comments closed

An Overview of Clustering Techniques in R

Peter Laurinec gives us an overview:

Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.

Read on for a quick description of what clustering is and a few use cases. Then, Peter dives into a variety of techniques and important things you should know about them. H/T R-Bloggers.

Comments closed

Benchmarking Cumulative Function Speed in TidyDensity

Steven Sanderson charts performance:

Statistical analysis often involves calculating various measures on large datasets. Speed and efficiency are crucial, especially when dealing with real-time analytics or massive data volumes. The TidyDensity package in R provides a set of fast cumulative functions for common statistical measures like mean, standard deviation, skewness, and kurtosis. But just how fast are these cumulative functions compared to doing the computations directly? In this post, I benchmark the cumulative functions against the base R implementations using the rbenchmark package.

Click through for the functions under test and how they fare.

Comments closed

The Triangular Distribution in TidyDensity

Steven Sanderson unleashes the power of the triangle:

Welcome back, fellow data enthusiasts! Today, we embark on an exciting journey into the world of statistical distributions with a special focus on the latest addition to the TidyDensity package – the triangular distribution. Tightly packed and versatile, this distribution brings a unique flavor to your data simulations and analyses. In this blog post, we’ll delve into the functions provided, understand their arguments, and explore the wonders of the triangular distribution.

Read on to learn what the triangular distribution is and how you can use work with it in TidyDensity.

Comments closed

TidyDensity 1.3.0 Released

Steven Sanderson has an update to the TidyDensity package:

The latest release of the TidyDensity R package brings some major changes and improvements that open up new possibilities for statistical analysis and data visualization. Version 1.3.0 includes breaking changes, new features, and a host of minor fixes and improvements that enhance performance and usability. Let’s dive into what’s new!

Read on for that change list and how you can get a copy of the TidyDensity R package.

Comments closed

Aggregating by Month and Year in R

Steven Sanderson groups by month and year:

Taming the beast of daily data can be daunting. While it captures every detail, sometimes you need a bird’s-eye view. Enter aggregation, your secret weapon for transforming daily data into monthly and yearly insights. In this post, we’ll dive into the world of R, where you’ll wield powerful tools like dplyr and lubridate to master this data wrangling art.

Click through for examples of summarizing daily data into monthly and annual data. One thing to keep in mind, however, is that the monthly aggregation in these examples is just month, so if you have July 2023 and July 2024 data, you’ll get a row back for July. It’s all about understanding what the grain of your data is, as well as your desired grain.

Comments closed

Explaining Models with Classic Methods and SHAP

Michael Mayer has some ‘splainin to do:

Let’s explain a {tidymodels} random forest by classic explainability methods (permutation importance, partial dependence plots (PDP), Friedman’s H statistics), and also fancy SHAP.

Disclaimer: {hstats}, {kernelshap} and {shapviz} are three of my own packages.

What I really appreciate in here is that Michael includes classic methods here. It can be easy to say “Oh, this is old and therefore no longer relevant.” But that would be quite wrong.

Comments closed