Press "Enter" to skip to content

Category: Data Science

An Overview of Cross-Validation

Vinod Chugani explains the benefit of cross-validation in a data science project:

Many beginners will initially rely on the train-test method to evaluate their models. This method is straightforward and seems to give a clear indication of how well a model performs on unseen data. However, this approach can often lead to an incomplete understanding of a model’s capabilities. In this blog, we’ll discuss why it’s important to go beyond the basic train-test split and how cross-validation can offer a more thorough evaluation of model performance. Join us as we guide you through the essential steps to achieve a deeper and more accurate assessment of your machine learning models.

Click through for the full article.

Comments closed

Tips for Hyperparameter Tuning

Bala Priya C shares some tips and techniques:

If you’re familiar with machine learning, you know that the training process allows the model to learn the optimal values for the parameters—or model coefficients—that characterize it. But machine learning models also have a set of hyperparameters whose values you should specify when training the model. So how do you find the optimal values for these hyperparameters?

You can use hyperparameter tuning to find the best values for the hyperparameters. By systematically adjusting hyperparameters, you can optimize your models to achieve the best possible results.

This tutorial provides practical tips for effective hyperparameter tuning—starting from building a baseline model to using advanced techniques like Bayesian optimization. Whether you’re new to hyperparameter tuning or looking to refine your approach, these tips will help you build better machine learning models. Let’s get started.

Read on for those techniques. Incidentally, one of my “Old man yells at clouds” takes is that I dislike the existence of hyperparameters and consider them a modeling failure, essentially telling the implementer to do part of the researcher’s work. Knowing that they are necessary to work with for so many algorithms, there’s nothing to do but learn how to work with them effectively, but there’s a feel of outsourcing the hard work to users that I don’t like about the process. For that reason, I have extra respect for algorithms that neither need nor offer hyperparameters.

Comments closed

Systematic Sampling in R

Steven Sanderson continues a series on sampling:

In this post, we will explore systematic sampling in R using base R functions. Systematic sampling is a technique where you select every (k^{th}) element from a list or dataset. This method is straightforward and useful when you want a representative sample without the complexity of more advanced sampling techniques.

Let’s dive into an example to understand how it works.

In very technical circles, this is also known as the “eenie-meenie-meiney-moe technique” and is very similar to the “duck-duck-goose” algorithm, though that has an additional stochastic input.

Comments closed

Cluster Sampling in R

Steven Sanderson shows us one sampling technique:

Cluster sampling is a useful technique when dealing with large datasets spread across different groups or clusters. It involves dividing the population into clusters, randomly selecting some clusters, and then sampling all or some members from these selected clusters. This method can save time and resources compared to simple random sampling.

In this post, we’ll walk through how to perform cluster sampling in R. We’ll use a sample dataset and break down the code step-by-step. By the end, you’ll have a clear understanding of how to implement cluster sampling in your projects.

Read on for the scenario and sample code.

Comments closed

Stratified Sampling in R

Steven Sanderson builds a sample:

Stratified sampling is a technique used to ensure that different subgroups (strata) within a population are represented in a sample. This method is particularly useful when certain strata are underrepresented in a simple random sample. In this post, we’ll explore how to perform stratified sampling in R using both base R and the dplyr package. We’ll walk through examples and explain the code, so you can try these techniques on your own data.

Click through to see how.

Comments closed

AutoML in Python with TPOT

Abid Ali Awan gives us a primer on TPOT:

AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.

In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.

Click through to see an example of how to use the library.

Comments closed

Changing Distributions and Simpson’s Paradox

Jerry Tuttle describes a paradox:

So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, “This number doesn’t look right.” Bosses have an uncanny ability to do this.

      Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

Although the blog post doesn’t explicitly mention Simpson’s paradox, I’d argue that this is a good example of the idea. H/T R-Bloggers.

Comments closed

Using the fast_regression() Method in tidyAML

Steven Sanderson says, It’s my regression and I want it NOW:

If you’ve ever faced the daunting task of setting up multiple regression models in R, you’ll appreciate the convenience and efficiency that tidyAML brings to the table. Today, we’re diving into one of its standout functions: fast_regression(). This function is designed to streamline the regression modeling process, allowing you to quickly create and evaluate a variety of model specifications with minimal code.

Read on to see how the function works.

Comments closed

Transferring Linear Model Coefficients

Nina Zumel performs a swap:

A quick glance through the scikit-learn documentation on linear models, or the CRAN task view on Mixed, Multilevel, and Hierarchical Models in R reveals a number of different procedures for fitting models with linear structure. Each of these procedures meet different needs and constraints, and some of them can be computationally intensive to compute. But in the end, they all have the same underlying structure: outcome is modelled as a linear combination of input features.

But the existence of so many different algorithms, and their associated software, can obscure the fact that just because two models were fit differently, they don’t have to be run differently. The fitting implementation and the deployment implementation can be distinct. In this note, we’ll talk about transferring the coefficients of a linear model to a fresh model, without a full retraining.

I had a similar problem about 18 months ago, though much easier than the one Nina describes, as I did have access to the original data and simply needed to build a linear regression in Python that matched exactly the one they developed in R. Turns out that’s not as easy to do as you might think: the different languages have different default assumptions that make the results similar but not the same, and piecing all of this together took a bit of sleuthing.

Comments closed

A/B Testing with Survival Analysis in R

Iyar Lin combines two great flavors:

Usually when running an A/B test analysts assign users randomly to variants over time and measure conversion rate as the ratio between the number of conversions and the number of users in each variant. Users who just entered the test and those who are in the test for 2 weeks get the same weight.

This can be enough for cases where a conversion either happens or not within a short time frame after assignment to a variant (e.g. Finishing an on-boarding flow).

There are however many instances where conversions are spread over a longer time frame. One example would be first order after visiting a site landing page. Such conversions may happen within minutes, but a large churn could also happen within days after the first visit.

Read on for the scenario, as well as a simulation. I will note that, in the digital marketing industry, there’s usually a hard cap on number of days where you’re able to attribute a conversion to some action for exactly the reason Iyar mentions. H/T R-Bloggers.

Comments closed