Press "Enter" to skip to content

Category: Data Science

Making XGBoost Run Faster

Ivan Palomares Carrascosa shares a few tips:

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Read on for two tips to reduce operational load and one to offload it to faster hardware (when possible).

Comments closed

An Introduction to Bayesian Regression

Ivan Palomares Carrascosa covers the concept of Bayesian regression:

In this article, you will learn:

  • The fundamental difference between traditional regression, which uses single fixed values for its parameters, and Bayesian regression, which models them as probability distributions.
  • How this probabilistic approach allows the model to produce a full distribution of possible outcomes, thereby quantifying the uncertainty in its predictions.
  • How to implement a simple Bayesian regression model in Python with scikit-learn.

My understanding is that both Bayesian and traditional regression techniques get you to (roughly) the same place, but the Bayesian approach makes it harder to forget that the regression line you draw doesn’t actually exist and everything has uncertainty.

Comments closed

Choosing between Random Forest and Gradient Boosting

Jayita Gulati compares two popular algorithms for classification:

When working with machine learning on structured data, two algorithms often rise to the top of the shortlist: random forests and gradient boosting. Both are ensemble methods built on decision trees, but they take very different approaches to improving model accuracy. Random forests emphasize diversity by training many trees in parallel and averaging their results, while gradient boosting builds trees sequentially, each one correcting the mistakes of the last.

This article explains how each method works, their key differences, and how to decide which one best fits your project.

Click through for the explanation.

Comments closed

Time Series Forecasting in Python

Myles Mitchell builds an ARIMA model:

In time series analysis we are interested in sequential data made up of a series of observations taken at regular intervals. Examples include:

  • Weekly hospital occupancy
  • Monthly sales figures
  • Annual global temperature

In many cases we want to use the observations up to the present day to predict (or forecast) the next N time points. For example, a hospital could reduce running costs if an appropriate number of beds are provisioned.

Read on for a primer on the topic, a quick explanation of ARIMA, and a sample implementation using several Python packages.

Comments closed

Time Series Helpers in NumPy

Bala Priya C shares some one-liners:

NumPy’s array operations can help simplify most common time series operations. Instead of thinking step-by-step through data transformations, you can apply vectorized operations that process entire datasets at once.

This article covers 10 NumPy one-liners that can be used for time series analysis tasks you’ll come across often. Let’s get started!

Click through to see the ten in action.

Comments closed

Contrasting Three Classification Algorithms for Small Datasets

Jayita Gulati compares a few mechanisms to classify data:

When you have a small dataset, choosing the right machine learning model can make a big difference. Three popular options are logistic regression, support vector machines (SVMs), and random forests. Each one has its strengths and weaknesses. Logistic regression is easy to understand and quick to train, SVMs are great for finding clear decision boundaries, and random forests are good at handling complex patterns, but the best choice often depends on the size and nature of your data.

In this article, we’ll compare these three methods and see which one tends to work best for smaller datasets.

All three are quite reasonable algorithms to compare, though I’d want to add in gradient descent or XGBoost, as I’d expect it to perform better than random forest with small datasets.

Comments closed

Tips for Working with Pandas

Matthew Mayo has a few tips when working with Pandas for data preparation:

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Click through for those tips and tricks.

Comments closed

Handling Missing Data in R

M. Fatih Tüzen fills in the gaps:

Data preprocessing is a cornerstone of any data analysis or machine learning pipeline. Raw data rarely comes in a form ready for direct analysis — it often requires cleaning, transformation, normalization, and careful handling of anomalies. Among these preprocessing tasks, dealing with missing data stands out as one of the most critical and unavoidable challenges.

Missing values appear in virtually every domain: surveys may have skipped questions, administrative registers might contain incomplete records, and clinical trials can suffer from dropout patients. Ignoring these gaps or handling them naively does not just reduce the amount of usable information; it can also introduce bias, decrease statistical power, and ultimately compromise the validity of conclusions. In other words, missing data is not just an inconvenience — it is a methodological problem that demands rigorous attention.

Quite often, we gloss over what to do with missing data when explaining or working through the data science process, in part because it’s a hard problem. This post digs into the specifics of the matter, taking us through eight separate methods. H/T R-Bloggers.

Comments closed

Diagnosing Classification Model Failures

Ivan Palomares Carrascosa looks into a poorly-fitting model:

In classification models, failure occurs when the model assigns the wrong class to a new data observation; that is, when its classification accuracy is not high enough over a certain number of predictions. It also manifests when a trained classifier fails to generalize well to new data that differs from the examples it was trained on. While model failure typically presents itself in several forms, including the aforementioned ones, the root causes can sometimes be more diverse and subtle.

This article explores some common reasons why classification models may underperform and outlines how to detect, diagnose, and mitigate these issues.

The explanations are fairly high-level and focus mostly on two-class rather than multi-class classification, but there is some good guidance in here.

Comments closed

A Primer on Bayesian Modeling

Hristo Hristov is speaking my language:

Multivariate analysis in data science is a type of analysis that tackles multiple input/predictor and output/predicted variables. This tip explores the problem of predicting air pollution measured in particulate matter (PM) concentration based on ambient temperature, humidity, and pressure using a Bayesian Model.

Click through for a detailed code sample and explanation.

Comments closed