Press "Enter" to skip to content

Category: Data Science

Text Classification with Decision Trees

Ivan Palomares Carrascosa takes us through a simple natural language processing problem and solution:

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Read on for the demos and to see how three different approaches work.

Comments closed

Feature Importance in XGBoost

Ivan Palomares Carrascosa takes a look at one of my favorite plots in XGBoost:

One of the most widespread machine learning techniques is XGBoost (Extreme Gradient Boosting). An XGBoost model — or an ensemble that combines multiple models into a single predictive task, to be more precise — builds several decision trees and sequentially combines them, so that the overall prediction is progressively improved by correcting the errors made by previous trees in the pipeline.

Just like standalone decision trees, XGBoost can accommodate both regression and classification tasks. While the combination of many trees into a single composite model may obscure its interpretability at first, there are still mechanisms to help you interpret an XGBoost model. In other words, you can understand why predictions are made and how input features contributed to them.

This article takes a practical dive into XGBoost model interpretability, with a particular focus on feature importance.

Read on to learn more about how feature importance works, as well as the three different views of the data you can get.

Comments closed

Portfolio Theory and Risk Reduction

John Mount continues a series on risk optimization:

I want to discuss how fragile optimization solutions to real world problems can be. And how to solve that.

Small changes in modeling strategy, assumptions, data, estimates, constraints, or objective can lead to unstable and degenerate solutions. To warm up let’s discuss one of the most famous optimization examples: Stigler’s minimal subsistence diet problem.

There are some neat stories in the post as you walk through problems of linear programming.

Also, Nina Zumel has a post on overestimation bias:

Revenue optimization projects can be particularly valuable and exciting. They involve:

  • Estimating demand as a function of offered features, price, and match to market.
  • Picking a set of offerings and prices optimizing the above inferred demand.

The great opportunity of these projects is that one can derive value from improving the inference of the demand estimate function, improving the optimization, and even improving the synergy between these two steps.

However, there is a common situation that can lose client trust and sink revenue optimization projects.

Read on for that article.

Comments closed

Modeling Uncertainty Early

John Mount isn’t quite sure:

Recently here at Win Vector LLC we have been delivering good client outcomes using the Stan MCMC sampler. It has allowed us to infer deep business factors, instead of being limited surface KPIs (key performance indicators). Modeling uncertainty requires stronger optimizers to solve our problems, but it leads to better anti-fragile business solutions.

A fun part of this is it really improves how visible uncertainty is. Let’s show this in a quick simplified example.

Click through for an explanation of classic optimization versus a more sophisticated approach that deals with uncertainty early and factors that into the optimization problem.

Comments closed

Reasons Regression Models Under-Perform

Ivan Palomares Carrascosa has a list:

In regression models, failure occurs when the model produces inaccurate predictions — that is, when error metrics like MAE or RMSE are high — or when the model, once deployed, fails to generalize well to new data that differs from the examples it was trained or tested on. While model failure typically shows up in one or both of these forms, the root causes can be more diverse and subtle.

This article explores some common reasons why regression models may underperform and outlines how to detect these issues. It is also accompanied by practical code excerpts using XGBoost — a robust and highly tunable ensemble-based regression model. Despite its popularity and power, XGBoost can also fail if not trained or evaluated properly!

These are high-level reasons but they’re good to keep in mind.

Comments closed

Getting beyond Pandas

Shittu Olumide recommends a few other packages:

If you’ve worked with data in Python, chances are you’ve used Pandas many times. And for good reason; it’s intuitive, flexible, and great for day-to-day analysis. But as your datasets start to grow, Pandas starts to show its limits. Maybe it’s memory issues, sluggish performance, or the fact that your machine sounds like it’s about to lift off when you try to group by a few million rows.

That’s the point where a lot of data analysts and scientists start asking the same question: what else is out there?

Read on for seven options, including six libraries and one built-in programming technique.

Comments closed

Using R for Forecasting in Excel

Adam Gladstone continues a series on using R in Excel:

We have already seen how to obtain descriptive statistics in Part I and how to use lm() in Part II. In this part (Part III) of the series we will look at using R in Excel to perform forecasting and time series analysis.

In the previous two parts we have seen different ways to handle the output from R function calls, unpacking and massaging the data as required. In this part we are going to focus on setting up and interacting with a number of models in the ‘forecast’ package (fpp2).

Read on for the demo. This is getting into territory that is by no means trivial to do natively in Excel.

Comments closed

Making k-Means Clustering Better

Matthew Mayo shares a few tips:

The k-means algorithm is a cornerstone of unsupervised machine learning, known for its simplicity and trusted for its efficiency in partitioning data into a predetermined number of clusters. Its straightforward approach — assigning data points to the nearest centroid and then updating the centroid based on the mean of the assigned points — makes it one of the first algorithms most data scientists learn. It is a workhorse, capable of providing quick and valuable insights into the underlying structure of a dataset.

This simplicity comes with a set of limitations, however. Standard k-means often struggles when faced with the complexities of real-world data. Its performance can be sensitive to the initial placement of centroids, it requires the number of clusters to be specified in advance, and it fundamentally assumes that clusters are spherical and evenly sized. These assumptions rarely hold true in the wild, leading to suboptimal or even misleading results.

Read on for a few ways to relax some of the constraints in k-means clustering.

Comments closed

Choosing a Good Split for a Decision Tree

Ivan Palomares Carrascosa continues a series on decision trees:

But what are the underlying mechanisms that make decision trees so well-suited for various predictive tasks? And what criteria are internally used to construct them? Specifically, how are nodes recursively split as the tree-shaped structure is formed? This article takes a closer look at the inner workings of decision trees, focusing on how branches are created through deliberate, data-driven splitting (spoiler: it certainly doesn’t happen at random).

One of the main principles of CART is around finding efficient splits for trees, and this digs into some of those details.

Comments closed

Decision Trees and Non-Tabular Data

Ivan Palomares Carrascosa explains that you can use more than standard structured data against decision trees:

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

Comments closed