Press "Enter" to skip to content

Category: Data Science

Exploring the Area under the ROC Curve

Aayush Srivastava takes us through one of the classics of classification:

In the realm of machine learning classification, model evaluation is an essential step to assess the performance and effectiveness of various algorithms. One widely-used tool for this purpose is the Area Under the Receiver Operating Characteristic Curve (AUC-ROC curve). In this blog, we will delve into the significance of the AUC-ROC curve, how it is calculated, and why it is an invaluable metric for evaluating classification models.

In this article, we will discuss the performance metrics used in the classification and also explore the implications of using two, namely AUC and ROC. Here is an overview of the important points that we will discuss in the article. 

The fun anecdote around ROC curves is that their name actually makes sense if you know the origin: it came out of the British army in World War II, where they tracked how their radar operators classified blips as German aircraft or noise (e.g., flocks of birds). The radar receiver operators had certain characteristics, where some were more effective at separating actual threats from noise, hence the Receiver Operating Characteristic curve.

Comments closed

Where the Bayesian and Frequentist Approaches Meet

Sebastian Sauer bridges the gap:

However, a disadvantage of Bayes analysis, at least at its current state, is that it has higher technical and computational demands. For beginners in particular, this may present a substantial (entry) burden. Teaching statistics, I have found that students (and many colleagues) have had difficulties installing Stan (particularly the C++ compiler needed in order to run Stan); Stan is the probabilistic programming language which many front-end Bayes engines use such as brms in R.

Thus, the installation process being not so user-friendly, a burden is placed for beginners which may prevent using Bayes methods.

In that light, this post explores the numerical simarilities of Bayes regression models and Frequentis models. The idea is to use a Frequentist regression model as a proxi for a full Bayesian analysis. The value added is the quick computation and the simple technical setup.

Click through for the conditions where you’ll find very similar results, as well as a few examples of it in action.

Comments closed

Row Re-Ordering in Shiny Apps

Stephane Laurent does a bit of work:

The ‘RowReorder’ extension of datatables is available in the DT package. This extension allows to reorder the rows of a DT table by dragging and dropping. However, if you enable this extension in a Shiny app for a table using the server-side processing (option server=TRUE in renderDT), that won’t work: each time the rows are reordered, they will jump back to their original locations.

Read on to see what you need to do in that case, as well as an example of how to do it. H/T R-Bloggers.

Comments closed

Ensembling Churn Prediction Techniques

Salman Khan gloms together multiple trained models to solve a churn prediction problem:

Historically, this domain has leaned on traditional statistical models, including logistic regression and decision trees. These methodologies sift through historical customer data to identify indicators predictive of future service discontinuation. Although these methods have demonstrated resilience over time, their adequacy is increasingly being questioned. In this regard, ensemble learning emerges as a sophisticated alternative, offering enhanced precision and reliability in identifying potential customer attrition.

Ensemble learning, in turn, distinguishes itself by simultaneously employing multiple predictive models to refine accuracy. This article, thus, aims to elucidate how ensemble learning can revolutionize the approach to churn prediction: we will explore various techniques such as Random Forest, Gradient Boosting, and Stacking, illustrating their efficacy in predicting customer churn through pragmatic examples.

Read on for an introduction to ensemble learning and some high-level tips to keep in mind when ensembling.

Comments closed

Bootstrapping in TidyDensity

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

An Overview of Clustering Techniques in R

Peter Laurinec gives us an overview:

Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.

Read on for a quick description of what clustering is and a few use cases. Then, Peter dives into a variety of techniques and important things you should know about them. H/T R-Bloggers.

Comments closed

The Triangular Distribution in TidyDensity

Steven Sanderson unleashes the power of the triangle:

Welcome back, fellow data enthusiasts! Today, we embark on an exciting journey into the world of statistical distributions with a special focus on the latest addition to the TidyDensity package – the triangular distribution. Tightly packed and versatile, this distribution brings a unique flavor to your data simulations and analyses. In this blog post, we’ll delve into the functions provided, understand their arguments, and explore the wonders of the triangular distribution.

Read on to learn what the triangular distribution is and how you can use work with it in TidyDensity.

Comments closed

Explaining Models with Classic Methods and SHAP

Michael Mayer has some ‘splainin to do:

Let’s explain a {tidymodels} random forest by classic explainability methods (permutation importance, partial dependence plots (PDP), Friedman’s H statistics), and also fancy SHAP.

Disclaimer: {hstats}, {kernelshap} and {shapviz} are three of my own packages.

What I really appreciate in here is that Michael includes classic methods here. It can be easy to say “Oh, this is old and therefore no longer relevant.” But that would be quite wrong.

Comments closed