Press "Enter" to skip to content

Category: Data Science

Quantile Normalization in R

Steven Sanderson has achieved normality:

Before we dive into the code, let’s understand the concept behind quantile normalization. At its core, quantile normalization aims to equalize the distributions of multiple datasets by aligning their quantiles. This ensures that each dataset has the same distribution of values, making meaningful comparisons possible.

This is a bit different from normalizing individual data points in one dataset, as you can see in the post.

Comments closed

A Bayesian Approach to CATPCHAs

John Cook claims to be human:

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

Read on to see how a Bayesian approach to the problem could make things a bit less annoying.

Comments closed

Plotting Training and Testing Results with tidyAML

Steven Sanderson builds a plot:

In the realm of machine learning, visualizing model predictions is essential for understanding the performance and behavior of our algorithms. When it comes to regression tasks, plotting predictions alongside actual values provides valuable insights into how well our model is capturing the underlying patterns in the data. With the plot_regression_predictions() function in tidyAML, this process becomes seamless and informative.

Read on to see how the function works and the kind of result you can expect from it.

Comments closed

Pulling Samples in R with sample()

Steven Sanderson takes a sample:

The sample() function in R is a powerful tool that allows you to generate random samples from a given dataset or vector. It’s an essential function for tasks such as data analysis, Monte Carlo simulations, and randomized experiments. In this blog post, we’ll explore the sample() function in detail and provide examples to help you understand how to use it effectively.

Read on to see what options are available with sample() and the different ways in which you can use the function.

Comments closed

Removing Skew in Data with Python

Vinod Chugani kicks the lop-sided distribution to straighten it out:

Data transformations enable data scientists to refine, normalize, and standardize raw data into a format ripe for analysis. These transformations are not merely procedural steps; they are essential in mitigating biases, handling skewed distributions, and enhancing the robustness of statistical models. This post will primarily focus on how to address skewed data. By focusing on the ‘SalePrice’ and ‘YearBuilt’ attributes from the Ames housing dataset, we will provide examples of positive and negative skewed data and illustrate ways to normalize their distributions using transformations.

Read on to see what kinds of transformations are available.

Comments closed

Classical Methods for Outlier Detection

Vinod Chugani is speaking my language:

Outliers are unique in that they often don’t play by the rules. These data points, which significantly differ from the rest, can skew your analyses and make your predictive models less accurate. Although detecting outliers is critical, there is no universally agreed-upon method for doing so. While some advanced techniques like machine learning offer solutions, in this post, we will focus on the foundational Data Science methods that have been in use for decades.

Vinod looks at a few techniques, including inter-quartile range and comparing results to an expected distribution. If you’re really excited about this topic, I know a guy who’s written a bit about it.

Comments closed

Exploring the Area under the ROC Curve

Aayush Srivastava takes us through one of the classics of classification:

In the realm of machine learning classification, model evaluation is an essential step to assess the performance and effectiveness of various algorithms. One widely-used tool for this purpose is the Area Under the Receiver Operating Characteristic Curve (AUC-ROC curve). In this blog, we will delve into the significance of the AUC-ROC curve, how it is calculated, and why it is an invaluable metric for evaluating classification models.

In this article, we will discuss the performance metrics used in the classification and also explore the implications of using two, namely AUC and ROC. Here is an overview of the important points that we will discuss in the article. 

The fun anecdote around ROC curves is that their name actually makes sense if you know the origin: it came out of the British army in World War II, where they tracked how their radar operators classified blips as German aircraft or noise (e.g., flocks of birds). The radar receiver operators had certain characteristics, where some were more effective at separating actual threats from noise, hence the Receiver Operating Characteristic curve.

Comments closed

Where the Bayesian and Frequentist Approaches Meet

Sebastian Sauer bridges the gap:

However, a disadvantage of Bayes analysis, at least at its current state, is that it has higher technical and computational demands. For beginners in particular, this may present a substantial (entry) burden. Teaching statistics, I have found that students (and many colleagues) have had difficulties installing Stan (particularly the C++ compiler needed in order to run Stan); Stan is the probabilistic programming language which many front-end Bayes engines use such as brms in R.

Thus, the installation process being not so user-friendly, a burden is placed for beginners which may prevent using Bayes methods.

In that light, this post explores the numerical simarilities of Bayes regression models and Frequentis models. The idea is to use a Frequentist regression model as a proxi for a full Bayesian analysis. The value added is the quick computation and the simple technical setup.

Click through for the conditions where you’ll find very similar results, as well as a few examples of it in action.

Comments closed

Row Re-Ordering in Shiny Apps

Stephane Laurent does a bit of work:

The ‘RowReorder’ extension of datatables is available in the DT package. This extension allows to reorder the rows of a DT table by dragging and dropping. However, if you enable this extension in a Shiny app for a table using the server-side processing (option server=TRUE in renderDT), that won’t work: each time the rows are reordered, they will jump back to their original locations.

Read on to see what you need to do in that case, as well as an example of how to do it. H/T R-Bloggers.

Comments closed

Ensembling Churn Prediction Techniques

Salman Khan gloms together multiple trained models to solve a churn prediction problem:

Historically, this domain has leaned on traditional statistical models, including logistic regression and decision trees. These methodologies sift through historical customer data to identify indicators predictive of future service discontinuation. Although these methods have demonstrated resilience over time, their adequacy is increasingly being questioned. In this regard, ensemble learning emerges as a sophisticated alternative, offering enhanced precision and reliability in identifying potential customer attrition.

Ensemble learning, in turn, distinguishes itself by simultaneously employing multiple predictive models to refine accuracy. This article, thus, aims to elucidate how ensemble learning can revolutionize the approach to churn prediction: we will explore various techniques such as Random Forest, Gradient Boosting, and Stacking, illustrating their efficacy in predicting customer churn through pragmatic examples.

Read on for an introduction to ensemble learning and some high-level tips to keep in mind when ensembling.

Comments closed