Press "Enter" to skip to content

Category: Data Science

Plotting Training and Testing Results with tidyAML

Steven Sanderson builds a plot:

In the realm of machine learning, visualizing model predictions is essential for understanding the performance and behavior of our algorithms. When it comes to regression tasks, plotting predictions alongside actual values provides valuable insights into how well our model is capturing the underlying patterns in the data. With the plot_regression_predictions() function in tidyAML, this process becomes seamless and informative.

Read on to see how the function works and the kind of result you can expect from it.

Leave a Comment

Pulling Samples in R with sample()

Steven Sanderson takes a sample:

The sample() function in R is a powerful tool that allows you to generate random samples from a given dataset or vector. It’s an essential function for tasks such as data analysis, Monte Carlo simulations, and randomized experiments. In this blog post, we’ll explore the sample() function in detail and provide examples to help you understand how to use it effectively.

Read on to see what options are available with sample() and the different ways in which you can use the function.

Leave a Comment

Removing Skew in Data with Python

Vinod Chugani kicks the lop-sided distribution to straighten it out:

Data transformations enable data scientists to refine, normalize, and standardize raw data into a format ripe for analysis. These transformations are not merely procedural steps; they are essential in mitigating biases, handling skewed distributions, and enhancing the robustness of statistical models. This post will primarily focus on how to address skewed data. By focusing on the ‘SalePrice’ and ‘YearBuilt’ attributes from the Ames housing dataset, we will provide examples of positive and negative skewed data and illustrate ways to normalize their distributions using transformations.

Read on to see what kinds of transformations are available.

Leave a Comment

Classical Methods for Outlier Detection

Vinod Chugani is speaking my language:

Outliers are unique in that they often don’t play by the rules. These data points, which significantly differ from the rest, can skew your analyses and make your predictive models less accurate. Although detecting outliers is critical, there is no universally agreed-upon method for doing so. While some advanced techniques like machine learning offer solutions, in this post, we will focus on the foundational Data Science methods that have been in use for decades.

Vinod looks at a few techniques, including inter-quartile range and comparing results to an expected distribution. If you’re really excited about this topic, I know a guy who’s written a bit about it.

Leave a Comment

Exploring the Area under the ROC Curve

Aayush Srivastava takes us through one of the classics of classification:

In the realm of machine learning classification, model evaluation is an essential step to assess the performance and effectiveness of various algorithms. One widely-used tool for this purpose is the Area Under the Receiver Operating Characteristic Curve (AUC-ROC curve). In this blog, we will delve into the significance of the AUC-ROC curve, how it is calculated, and why it is an invaluable metric for evaluating classification models.

In this article, we will discuss the performance metrics used in the classification and also explore the implications of using two, namely AUC and ROC. Here is an overview of the important points that we will discuss in the article. 

The fun anecdote around ROC curves is that their name actually makes sense if you know the origin: it came out of the British army in World War II, where they tracked how their radar operators classified blips as German aircraft or noise (e.g., flocks of birds). The radar receiver operators had certain characteristics, where some were more effective at separating actual threats from noise, hence the Receiver Operating Characteristic curve.

Comments closed

Where the Bayesian and Frequentist Approaches Meet

Sebastian Sauer bridges the gap:

However, a disadvantage of Bayes analysis, at least at its current state, is that it has higher technical and computational demands. For beginners in particular, this may present a substantial (entry) burden. Teaching statistics, I have found that students (and many colleagues) have had difficulties installing Stan (particularly the C++ compiler needed in order to run Stan); Stan is the probabilistic programming language which many front-end Bayes engines use such as brms in R.

Thus, the installation process being not so user-friendly, a burden is placed for beginners which may prevent using Bayes methods.

In that light, this post explores the numerical simarilities of Bayes regression models and Frequentis models. The idea is to use a Frequentist regression model as a proxi for a full Bayesian analysis. The value added is the quick computation and the simple technical setup.

Click through for the conditions where you’ll find very similar results, as well as a few examples of it in action.

Comments closed

Row Re-Ordering in Shiny Apps

Stephane Laurent does a bit of work:

The ‘RowReorder’ extension of datatables is available in the DT package. This extension allows to reorder the rows of a DT table by dragging and dropping. However, if you enable this extension in a Shiny app for a table using the server-side processing (option server=TRUE in renderDT), that won’t work: each time the rows are reordered, they will jump back to their original locations.

Read on to see what you need to do in that case, as well as an example of how to do it. H/T R-Bloggers.

Comments closed

Ensembling Churn Prediction Techniques

Salman Khan gloms together multiple trained models to solve a churn prediction problem:

Historically, this domain has leaned on traditional statistical models, including logistic regression and decision trees. These methodologies sift through historical customer data to identify indicators predictive of future service discontinuation. Although these methods have demonstrated resilience over time, their adequacy is increasingly being questioned. In this regard, ensemble learning emerges as a sophisticated alternative, offering enhanced precision and reliability in identifying potential customer attrition.

Ensemble learning, in turn, distinguishes itself by simultaneously employing multiple predictive models to refine accuracy. This article, thus, aims to elucidate how ensemble learning can revolutionize the approach to churn prediction: we will explore various techniques such as Random Forest, Gradient Boosting, and Stacking, illustrating their efficacy in predicting customer churn through pragmatic examples.

Read on for an introduction to ensemble learning and some high-level tips to keep in mind when ensembling.

Comments closed

Bootstrapping in TidyDensity

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed