Press "Enter" to skip to content

Category: Data Science

Comparing the ROC Curve to a Precision-Recall Curve

Ivan Palomares Carrascosa looks at two ways to plot classification model trade-offs:

When building machine learning models to classify imbalanced data — i.e. datasets where the presence of one class (like spam email for example) is much less frequent than the presence of the other class (non-spam email, for instance) — certain traditional metrics like accuracy or even the ROC AUC (Receiving Operating Characteristic curve and the area under it) may not reflect the model performance in realistic terms, giving overly optimistic estimates due to the dominance of the so-called negative class.

Precision-recall curves (or PR curves for short), on the other hand, are designed to focus specifically on the positive, typically rarer class, which is a much more informative measure for skewed datasets due to class imbalance.

Read on to see how these two curves can diverge and when you might trust one over the other. Ivan’s post does rely on the idea of the positive class being the smaller one and the dataset being markedly unbalanced

Leave a Comment

Challenges of High-Dimensional Optimization

John Mount lays out a demonstration:

My experience is that common objective functions tend to be structured and full of coincidences and symmetries. And because they have these structures they are hard to optimize.

Let’s work up what I claim to be a fairly typical optimization problem that arises from planning or scheduling. I’ll call it the train arrival schedule problem.

Click through for the article, which includes demonstration code.

Leave a Comment

K-Means Clustering in SQL Server

Sebastiao Pereira implements k-means clustering in T-SQL:

K-means clustering is an unsupervised machine learning algorithm used to group data into k distinct clusters based on their similarity, allowing for customer segmentation, anomaly detection, trend analysis, etc. The most common machine learning tutorials focus on Python or R. Normally, data is stored in SQL Server, and it is necessary to move data out of the database to apply clustering algorithms and then, if necessary, to update the original data with the cluster numbers. Is it possible to do it directly in SQL Server?

Given the work you have to do to implement this, I can’t imagine that it would be particularly fast. But it is neat to see that it’s possible.

Leave a Comment

Making XGBoost Run Faster

Ivan Palomares Carrascosa shares a few tips:

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Read on for two tips to reduce operational load and one to offload it to faster hardware (when possible).

Leave a Comment

An Introduction to Bayesian Regression

Ivan Palomares Carrascosa covers the concept of Bayesian regression:

In this article, you will learn:

  • The fundamental difference between traditional regression, which uses single fixed values for its parameters, and Bayesian regression, which models them as probability distributions.
  • How this probabilistic approach allows the model to produce a full distribution of possible outcomes, thereby quantifying the uncertainty in its predictions.
  • How to implement a simple Bayesian regression model in Python with scikit-learn.

My understanding is that both Bayesian and traditional regression techniques get you to (roughly) the same place, but the Bayesian approach makes it harder to forget that the regression line you draw doesn’t actually exist and everything has uncertainty.

Leave a Comment

Choosing between Random Forest and Gradient Boosting

Jayita Gulati compares two popular algorithms for classification:

When working with machine learning on structured data, two algorithms often rise to the top of the shortlist: random forests and gradient boosting. Both are ensemble methods built on decision trees, but they take very different approaches to improving model accuracy. Random forests emphasize diversity by training many trees in parallel and averaging their results, while gradient boosting builds trees sequentially, each one correcting the mistakes of the last.

This article explains how each method works, their key differences, and how to decide which one best fits your project.

Click through for the explanation.

Leave a Comment

Time Series Forecasting in Python

Myles Mitchell builds an ARIMA model:

In time series analysis we are interested in sequential data made up of a series of observations taken at regular intervals. Examples include:

  • Weekly hospital occupancy
  • Monthly sales figures
  • Annual global temperature

In many cases we want to use the observations up to the present day to predict (or forecast) the next N time points. For example, a hospital could reduce running costs if an appropriate number of beds are provisioned.

Read on for a primer on the topic, a quick explanation of ARIMA, and a sample implementation using several Python packages.

Leave a Comment

Time Series Helpers in NumPy

Bala Priya C shares some one-liners:

NumPy’s array operations can help simplify most common time series operations. Instead of thinking step-by-step through data transformations, you can apply vectorized operations that process entire datasets at once.

This article covers 10 NumPy one-liners that can be used for time series analysis tasks you’ll come across often. Let’s get started!

Click through to see the ten in action.

Leave a Comment

Contrasting Three Classification Algorithms for Small Datasets

Jayita Gulati compares a few mechanisms to classify data:

When you have a small dataset, choosing the right machine learning model can make a big difference. Three popular options are logistic regression, support vector machines (SVMs), and random forests. Each one has its strengths and weaknesses. Logistic regression is easy to understand and quick to train, SVMs are great for finding clear decision boundaries, and random forests are good at handling complex patterns, but the best choice often depends on the size and nature of your data.

In this article, we’ll compare these three methods and see which one tends to work best for smaller datasets.

All three are quite reasonable algorithms to compare, though I’d want to add in gradient descent or XGBoost, as I’d expect it to perform better than random forest with small datasets.

Leave a Comment

Tips for Working with Pandas

Matthew Mayo has a few tips when working with Pandas for data preparation:

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Click through for those tips and tricks.

Comments closed