Press "Enter" to skip to content

Category: Data Science

Contrasting Three Classification Algorithms for Small Datasets

Jayita Gulati compares a few mechanisms to classify data:

When you have a small dataset, choosing the right machine learning model can make a big difference. Three popular options are logistic regression, support vector machines (SVMs), and random forests. Each one has its strengths and weaknesses. Logistic regression is easy to understand and quick to train, SVMs are great for finding clear decision boundaries, and random forests are good at handling complex patterns, but the best choice often depends on the size and nature of your data.

In this article, we’ll compare these three methods and see which one tends to work best for smaller datasets.

All three are quite reasonable algorithms to compare, though I’d want to add in gradient descent or XGBoost, as I’d expect it to perform better than random forest with small datasets.

Leave a Comment

Tips for Working with Pandas

Matthew Mayo has a few tips when working with Pandas for data preparation:

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Click through for those tips and tricks.

Leave a Comment

Handling Missing Data in R

M. Fatih Tüzen fills in the gaps:

Data preprocessing is a cornerstone of any data analysis or machine learning pipeline. Raw data rarely comes in a form ready for direct analysis — it often requires cleaning, transformation, normalization, and careful handling of anomalies. Among these preprocessing tasks, dealing with missing data stands out as one of the most critical and unavoidable challenges.

Missing values appear in virtually every domain: surveys may have skipped questions, administrative registers might contain incomplete records, and clinical trials can suffer from dropout patients. Ignoring these gaps or handling them naively does not just reduce the amount of usable information; it can also introduce bias, decrease statistical power, and ultimately compromise the validity of conclusions. In other words, missing data is not just an inconvenience — it is a methodological problem that demands rigorous attention.

Quite often, we gloss over what to do with missing data when explaining or working through the data science process, in part because it’s a hard problem. This post digs into the specifics of the matter, taking us through eight separate methods. H/T R-Bloggers.

Leave a Comment

Diagnosing Classification Model Failures

Ivan Palomares Carrascosa looks into a poorly-fitting model:

In classification models, failure occurs when the model assigns the wrong class to a new data observation; that is, when its classification accuracy is not high enough over a certain number of predictions. It also manifests when a trained classifier fails to generalize well to new data that differs from the examples it was trained on. While model failure typically presents itself in several forms, including the aforementioned ones, the root causes can sometimes be more diverse and subtle.

This article explores some common reasons why classification models may underperform and outlines how to detect, diagnose, and mitigate these issues.

The explanations are fairly high-level and focus mostly on two-class rather than multi-class classification, but there is some good guidance in here.

Leave a Comment

Text Classification with Decision Trees

Ivan Palomares Carrascosa takes us through a simple natural language processing problem and solution:

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Read on for the demos and to see how three different approaches work.

Leave a Comment

Feature Importance in XGBoost

Ivan Palomares Carrascosa takes a look at one of my favorite plots in XGBoost:

One of the most widespread machine learning techniques is XGBoost (Extreme Gradient Boosting). An XGBoost model — or an ensemble that combines multiple models into a single predictive task, to be more precise — builds several decision trees and sequentially combines them, so that the overall prediction is progressively improved by correcting the errors made by previous trees in the pipeline.

Just like standalone decision trees, XGBoost can accommodate both regression and classification tasks. While the combination of many trees into a single composite model may obscure its interpretability at first, there are still mechanisms to help you interpret an XGBoost model. In other words, you can understand why predictions are made and how input features contributed to them.

This article takes a practical dive into XGBoost model interpretability, with a particular focus on feature importance.

Read on to learn more about how feature importance works, as well as the three different views of the data you can get.

Leave a Comment

Portfolio Theory and Risk Reduction

John Mount continues a series on risk optimization:

I want to discuss how fragile optimization solutions to real world problems can be. And how to solve that.

Small changes in modeling strategy, assumptions, data, estimates, constraints, or objective can lead to unstable and degenerate solutions. To warm up let’s discuss one of the most famous optimization examples: Stigler’s minimal subsistence diet problem.

There are some neat stories in the post as you walk through problems of linear programming.

Also, Nina Zumel has a post on overestimation bias:

Revenue optimization projects can be particularly valuable and exciting. They involve:

  • Estimating demand as a function of offered features, price, and match to market.
  • Picking a set of offerings and prices optimizing the above inferred demand.

The great opportunity of these projects is that one can derive value from improving the inference of the demand estimate function, improving the optimization, and even improving the synergy between these two steps.

However, there is a common situation that can lose client trust and sink revenue optimization projects.

Read on for that article.

Leave a Comment

Modeling Uncertainty Early

John Mount isn’t quite sure:

Recently here at Win Vector LLC we have been delivering good client outcomes using the Stan MCMC sampler. It has allowed us to infer deep business factors, instead of being limited surface KPIs (key performance indicators). Modeling uncertainty requires stronger optimizers to solve our problems, but it leads to better anti-fragile business solutions.

A fun part of this is it really improves how visible uncertainty is. Let’s show this in a quick simplified example.

Click through for an explanation of classic optimization versus a more sophisticated approach that deals with uncertainty early and factors that into the optimization problem.

Leave a Comment

Reasons Regression Models Under-Perform

Ivan Palomares Carrascosa has a list:

In regression models, failure occurs when the model produces inaccurate predictions — that is, when error metrics like MAE or RMSE are high — or when the model, once deployed, fails to generalize well to new data that differs from the examples it was trained or tested on. While model failure typically shows up in one or both of these forms, the root causes can be more diverse and subtle.

This article explores some common reasons why regression models may underperform and outlines how to detect these issues. It is also accompanied by practical code excerpts using XGBoost — a robust and highly tunable ensemble-based regression model. Despite its popularity and power, XGBoost can also fail if not trained or evaluated properly!

These are high-level reasons but they’re good to keep in mind.

Comments closed