Press "Enter" to skip to content

Category: Data Science

Bootstrapping in TidyDensity

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

An Overview of Clustering Techniques in R

Peter Laurinec gives us an overview:

Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.

Read on for a quick description of what clustering is and a few use cases. Then, Peter dives into a variety of techniques and important things you should know about them. H/T R-Bloggers.

Comments closed

The Triangular Distribution in TidyDensity

Steven Sanderson unleashes the power of the triangle:

Welcome back, fellow data enthusiasts! Today, we embark on an exciting journey into the world of statistical distributions with a special focus on the latest addition to the TidyDensity package – the triangular distribution. Tightly packed and versatile, this distribution brings a unique flavor to your data simulations and analyses. In this blog post, we’ll delve into the functions provided, understand their arguments, and explore the wonders of the triangular distribution.

Read on to learn what the triangular distribution is and how you can use work with it in TidyDensity.

Comments closed

Explaining Models with Classic Methods and SHAP

Michael Mayer has some ‘splainin to do:

Let’s explain a {tidymodels} random forest by classic explainability methods (permutation importance, partial dependence plots (PDP), Friedman’s H statistics), and also fancy SHAP.

Disclaimer: {hstats}, {kernelshap} and {shapviz} are three of my own packages.

What I really appreciate in here is that Michael includes classic methods here. It can be easy to say “Oh, this is old and therefore no longer relevant.” But that would be quite wrong.

Comments closed

LOWESS Smoothing in R

Steven Sanderson had me thinking of LOESS but then, bam!, snuck this in on me:

Locally Weighted Scatterplot Smoothing, or Lowess, is a powerful technique for capturing trends in noisy data. It’s particularly useful when dealing with datasets that exhibit complex patterns that might be missed by other methods. So, let’s get our hands dirty and start coding!

Read on for an example of LOWESS smoothing, which actually is a little different from LOESS. If you’re interested in learning more about the differences between LOESS and LOWESS, this Stack Exchange question and answer page is really good.

Comments closed

Quantile Regression using Random Forests

Norm Matloff answers a reader question:

In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.

Read on to learn more about how to do this, including some of the challenges you’ll face along the way. H/T R-Bloggers.

Comments closed

Reversion to the Mean

Holger von Jouanne-Diedrich explains an important statistical concept we all too often forget:

In the realm of business and leadership, one statistical phenomenon often goes unrecognized yet significantly influences our understanding of performance and success. This is the concept of reversion to the mean (also called regression to the mean). This seemingly simple statistical occurrence can profoundly impact how we perceive management strategies, leadership effectiveness, and even the fate of those gracing the covers of prominent magazines. To understand what is going on, read on!

Read on for a video in German and an article in English, with some bonus R code to sell the story.

Comments closed

Notes on Linear Markov Chains

John Mount has some thoughts for us:

I want to collect some “great things to know about linear Markov chains.”

For this note we are working with a Markov chain on states that are the integers 0 through k (k > 0). A Markov chain is an iterative random process with time tracked as an increasing integer t, and the next state of the chain depending only on the current (soon to be previous) state. For our linear Markov chain the only possible next states from state i are: i (called a “self loop” when present), i+1 (called up or right), and i-1 (called down or left). In no case does the chain progress below 0 or above k.

Click through for notes on two variants of this sort of linear Markov chain, as well as a pair of appendices containing derivation notes and Python code.

Comments closed