Data Science – Page 12

Historically, this domain has leaned on traditional statistical models, including logistic regression and decision trees. These methodologies sift through historical customer data to identify indicators predictive of future service discontinuation. Although these methods have demonstrated resilience over time, their adequacy is increasingly being questioned. In this regard, ensemble learning emerges as a sophisticated alternative, offering enhanced precision and reliability in identifying potential customer attrition.

Ensemble learning, in turn, distinguishes itself by simultaneously employing multiple predictive models to refine accuracy. This article, thus, aims to elucidate how ensemble learning can revolutionize the approach to churn prediction: we will explore various techniques such as Random Forest, Gradient Boosting, and Stacking, illustrating their efficacy in predicting customer churn through pragmatic examples.

Read on for an introduction to ensemble learning and some high-level tips to keep in mind when ensembling.

Comments closed

Bootstrapping in TidyDensity

Published 2024-01-23 by Kevin Feasel

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed

tidyAML Updates

Published 2024-01-19 by Kevin Feasel

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.

Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.

Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

An Overview of Clustering Techniques in R

Published 2024-01-12 by Kevin Feasel

Peter Laurinec gives us an overview:

Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.

Read on for a quick description of what clustering is and a few use cases. Then, Peter dives into a variety of techniques and important things you should know about them. H/T R-Bloggers.

Comments closed

The Triangular Distribution in TidyDensity

Published 2024-01-11 by Kevin Feasel

Steven Sanderson unleashes the power of the triangle:

Welcome back, fellow data enthusiasts! Today, we embark on an exciting journey into the world of statistical distributions with a special focus on the latest addition to the TidyDensity package – the triangular distribution. Tightly packed and versatile, this distribution brings a unique flavor to your data simulations and analyses. In this blog post, we’ll delve into the functions provided, understand their arguments, and explore the wonders of the triangular distribution.

Read on to learn what the triangular distribution is and how you can use work with it in TidyDensity.

Comments closed

Rolling Averages in R

Published 2024-01-08 by Kevin Feasel

Steven Sanderson performs a moving average:

Ever felt those data points were a bit too jittery? Smoothing out trends and revealing underlying patterns is a breeze with rolling averages in R. Ready to roll? Let’s dive in!

Read on to see one way to do this in R.

Comments closed

Explaining Models with Classic Methods and SHAP

Published 2024-01-08 by Kevin Feasel

Michael Mayer has some ‘splainin to do:

Let’s explain a {tidymodels} random forest by classic explainability methods (permutation importance, partial dependence plots (PDP), Friedman’s H statistics), and also fancy SHAP.

Disclaimer: {hstats}, {kernelshap} and {shapviz} are three of my own packages.

What I really appreciate in here is that Michael includes classic methods here. It can be easy to say “Oh, this is old and therefore no longer relevant.” But that would be quite wrong.

Comments closed

LOWESS Smoothing in R

Published 2024-01-05 by Kevin Feasel

Steven Sanderson had me thinking of LOESS but then, bam!, snuck this in on me:

Locally Weighted Scatterplot Smoothing, or Lowess, is a powerful technique for capturing trends in noisy data. It’s particularly useful when dealing with datasets that exhibit complex patterns that might be missed by other methods. So, let’s get our hands dirty and start coding!

Read on for an example of LOWESS smoothing, which actually is a little different from LOESS. If you’re interested in learning more about the differences between LOESS and LOWESS, this Stack Exchange question and answer page is really good.

Comments closed

Quantile Regression using Random Forests

Published 2024-01-03 by Kevin Feasel

Norm Matloff answers a reader question:

In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.

Read on to learn more about how to do this, including some of the challenges you’ll face along the way. H/T R-Bloggers.

Comments closed

Reversion to the Mean

Published 2024-01-02 by Kevin Feasel

Holger von Jouanne-Diedrich explains an important statistical concept we all too often forget:

In the realm of business and leadership, one statistical phenomenon often goes unrecognized yet significantly influences our understanding of performance and success. This is the concept of reversion to the mean (also called regression to the mean). This seemingly simple statistical occurrence can profoundly impact how we perceive management strategies, leadership effectiveness, and even the fate of those gracing the covers of prominent magazines. To understand what is going on, read on!

Read on for a video in German and an article in English, with some bonus R code to sell the story.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science

Ensembling Churn Prediction Techniques

Bootstrapping in TidyDensity

tidyAML Updates

An Overview of Clustering Techniques in R

The Triangular Distribution in TidyDensity

Rolling Averages in R

Explaining Models with Classic Methods and SHAP

LOWESS Smoothing in R

Quantile Regression using Random Forests

Reversion to the Mean