Press "Enter" to skip to content

Category: Python

Comparing the ROC Curve to a Precision-Recall Curve

Ivan Palomares Carrascosa looks at two ways to plot classification model trade-offs:

When building machine learning models to classify imbalanced data — i.e. datasets where the presence of one class (like spam email for example) is much less frequent than the presence of the other class (non-spam email, for instance) — certain traditional metrics like accuracy or even the ROC AUC (Receiving Operating Characteristic curve and the area under it) may not reflect the model performance in realistic terms, giving overly optimistic estimates due to the dominance of the so-called negative class.

Precision-recall curves (or PR curves for short), on the other hand, are designed to focus specifically on the positive, typically rarer class, which is a much more informative measure for skewed datasets due to class imbalance.

Read on to see how these two curves can diverge and when you might trust one over the other. Ivan’s post does rely on the idea of the positive class being the smaller one and the dataset being markedly unbalanced

Leave a Comment

Challenges of High-Dimensional Optimization

John Mount lays out a demonstration:

My experience is that common objective functions tend to be structured and full of coincidences and symmetries. And because they have these structures they are hard to optimize.

Let’s work up what I claim to be a fairly typical optimization problem that arises from planning or scheduling. I’ll call it the train arrival schedule problem.

Click through for the article, which includes demonstration code.

Leave a Comment

Modifying Power BI Page Visibility and Active Status via Semantic Link Labs

Meagan Longoria hides (or shows) a page:

Setting page visibility and the active page are often overlooked last steps when publishing a Power BI report. It’s easy to forget the active page since it’s just set to whatever page was open when you last saved the report. But we don’t have to settle for manually checking these things before we deploy to a new workspace (e.g., from dev to prod). If our report is in PBIR format, we can run Fabric notebooks to do this for us.

Click through for a notebook and an explanation.

Leave a Comment

An Introduction to Batch Normalization in Neural Networks

Ivan Palomares Carrascosa shows off one technique for optimizing neural networks:

Deep neural networks have drastically evolved over the years, overcoming common challenges that arise when training these complex models. This evolution has enabled them to solve increasingly difficult problems effectively.

One of the mechanisms that has proven especially influential in the advancement of neural network-based models is batch normalization. This article provides a gentle introduction to this strategy, which has become a standard in many modern architectures, helping to improve model performance by stabilizing training, speeding up convergence, and more.

Read on for a quick description of how it works and a demonstration in Keras.

Leave a Comment

Making XGBoost Run Faster

Ivan Palomares Carrascosa shares a few tips:

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Read on for two tips to reduce operational load and one to offload it to faster hardware (when possible).

Leave a Comment

An Introduction to Bayesian Regression

Ivan Palomares Carrascosa covers the concept of Bayesian regression:

In this article, you will learn:

  • The fundamental difference between traditional regression, which uses single fixed values for its parameters, and Bayesian regression, which models them as probability distributions.
  • How this probabilistic approach allows the model to produce a full distribution of possible outcomes, thereby quantifying the uncertainty in its predictions.
  • How to implement a simple Bayesian regression model in Python with scikit-learn.

My understanding is that both Bayesian and traditional regression techniques get you to (roughly) the same place, but the Bayesian approach makes it harder to forget that the regression line you draw doesn’t actually exist and everything has uncertainty.

Leave a Comment

Time Series Forecasting in Python

Myles Mitchell builds an ARIMA model:

In time series analysis we are interested in sequential data made up of a series of observations taken at regular intervals. Examples include:

  • Weekly hospital occupancy
  • Monthly sales figures
  • Annual global temperature

In many cases we want to use the observations up to the present day to predict (or forecast) the next N time points. For example, a hospital could reduce running costs if an appropriate number of beds are provisioned.

Read on for a primer on the topic, a quick explanation of ARIMA, and a sample implementation using several Python packages.

Leave a Comment

Time Series Helpers in NumPy

Bala Priya C shares some one-liners:

NumPy’s array operations can help simplify most common time series operations. Instead of thinking step-by-step through data transformations, you can apply vectorized operations that process entire datasets at once.

This article covers 10 NumPy one-liners that can be used for time series analysis tasks you’ll come across often. Let’s get started!

Click through to see the ten in action.

Leave a Comment

Tips for Working with Pandas

Matthew Mayo has a few tips when working with Pandas for data preparation:

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Click through for those tips and tricks.

Comments closed

A Primer on Bayesian Modeling

Hristo Hristov is speaking my language:

Multivariate analysis in data science is a type of analysis that tackles multiple input/predictor and output/predicted variables. This tip explores the problem of predicting air pollution measured in particulate matter (PM) concentration based on ambient temperature, humidity, and pressure using a Bayesian Model.

Click through for a detailed code sample and explanation.

Comments closed