Press "Enter" to skip to content

Category: Data Science

A Primer on One-Hot Encoding

Vinod Chugani does a bit of data modeling:

Preparing categorical data correctly is a fundamental step in machine learning, particularly when using linear models. One Hot Encoding stands out as a key technique, enabling the transformation of categorical variables into a machine-understandable format. This post tells you why you cannot use a categorical variable directly and demonstrates the use One Hot Encoding in our search for identifying the most predictive categorical features for linear regression.

Read the whole thing.

Comments closed

Analyzing the Game Wingspan

Dan Oehm builds a meta:

Wingspan is a great game even though I’ve only played it a few times. The mechanics are great, there are lots of bird varitions, and a bunch of different strategies to try. There are 170 birds, and I’ve probably only seen 30 of them. So, true to form, I’ve dabbled in a bit of data analysis to get a view of all the different types of cards in the game.

Open source wins again since the {wingspan} R package exists. It contains the details of each bird in the core, European, Oceania, and swift start sets. I’ll only be using the core set for this analysis since that’s the only one I’m semi familiar with.

Having not played the game before, Dan’s visuals drew me in. There’s also a regression analysis and discussion of the trade-off between in-game power versus victory points. H/T R-Bloggers.

Comments closed

An Overview of Cross-Validation

Vinod Chugani explains the benefit of cross-validation in a data science project:

Many beginners will initially rely on the train-test method to evaluate their models. This method is straightforward and seems to give a clear indication of how well a model performs on unseen data. However, this approach can often lead to an incomplete understanding of a model’s capabilities. In this blog, we’ll discuss why it’s important to go beyond the basic train-test split and how cross-validation can offer a more thorough evaluation of model performance. Join us as we guide you through the essential steps to achieve a deeper and more accurate assessment of your machine learning models.

Click through for the full article.

Comments closed

Tips for Hyperparameter Tuning

Bala Priya C shares some tips and techniques:

If you’re familiar with machine learning, you know that the training process allows the model to learn the optimal values for the parameters—or model coefficients—that characterize it. But machine learning models also have a set of hyperparameters whose values you should specify when training the model. So how do you find the optimal values for these hyperparameters?

You can use hyperparameter tuning to find the best values for the hyperparameters. By systematically adjusting hyperparameters, you can optimize your models to achieve the best possible results.

This tutorial provides practical tips for effective hyperparameter tuning—starting from building a baseline model to using advanced techniques like Bayesian optimization. Whether you’re new to hyperparameter tuning or looking to refine your approach, these tips will help you build better machine learning models. Let’s get started.

Read on for those techniques. Incidentally, one of my “Old man yells at clouds” takes is that I dislike the existence of hyperparameters and consider them a modeling failure, essentially telling the implementer to do part of the researcher’s work. Knowing that they are necessary to work with for so many algorithms, there’s nothing to do but learn how to work with them effectively, but there’s a feel of outsourcing the hard work to users that I don’t like about the process. For that reason, I have extra respect for algorithms that neither need nor offer hyperparameters.

Comments closed

Systematic Sampling in R

Steven Sanderson continues a series on sampling:

In this post, we will explore systematic sampling in R using base R functions. Systematic sampling is a technique where you select every (k^{th}) element from a list or dataset. This method is straightforward and useful when you want a representative sample without the complexity of more advanced sampling techniques.

Let’s dive into an example to understand how it works.

In very technical circles, this is also known as the “eenie-meenie-meiney-moe technique” and is very similar to the “duck-duck-goose” algorithm, though that has an additional stochastic input.

Comments closed

Cluster Sampling in R

Steven Sanderson shows us one sampling technique:

Cluster sampling is a useful technique when dealing with large datasets spread across different groups or clusters. It involves dividing the population into clusters, randomly selecting some clusters, and then sampling all or some members from these selected clusters. This method can save time and resources compared to simple random sampling.

In this post, we’ll walk through how to perform cluster sampling in R. We’ll use a sample dataset and break down the code step-by-step. By the end, you’ll have a clear understanding of how to implement cluster sampling in your projects.

Read on for the scenario and sample code.

Comments closed

Stratified Sampling in R

Steven Sanderson builds a sample:

Stratified sampling is a technique used to ensure that different subgroups (strata) within a population are represented in a sample. This method is particularly useful when certain strata are underrepresented in a simple random sample. In this post, we’ll explore how to perform stratified sampling in R using both base R and the dplyr package. We’ll walk through examples and explain the code, so you can try these techniques on your own data.

Click through to see how.

Comments closed

AutoML in Python with TPOT

Abid Ali Awan gives us a primer on TPOT:

AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.

In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.

Click through to see an example of how to use the library.

Comments closed

Changing Distributions and Simpson’s Paradox

Jerry Tuttle describes a paradox:

So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, “This number doesn’t look right.” Bosses have an uncanny ability to do this.

      Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

Although the blog post doesn’t explicitly mention Simpson’s paradox, I’d argue that this is a good example of the idea. H/T R-Bloggers.

Comments closed

Using the fast_regression() Method in tidyAML

Steven Sanderson says, It’s my regression and I want it NOW:

If you’ve ever faced the daunting task of setting up multiple regression models in R, you’ll appreciate the convenience and efficiency that tidyAML brings to the table. Today, we’re diving into one of its standout functions: fast_regression(). This function is designed to streamline the regression modeling process, allowing you to quickly create and evaluate a variety of model specifications with minimal code.

Read on to see how the function works.

Comments closed