Press "Enter" to skip to content

Category: Data Science

Simple Data Cleanup with Pandas

Ivan Palomares Carrascosa builds a process:

Few data science projects are exempt from the necessity of cleaning data. Data cleaning encompasses the initial steps of preparing data. Its specific purpose is that only the relevant and useful information underlying the data is retained, be it for its posterior analysis, to use as inputs to an AI or machine learning model, and so on. Unifying or converting data types, dealing with missing values, eliminating noisy values stemming from erroneous measurements, and removing duplicates are some examples of typical processes within the data cleaning stage.

As you might think, the more complex the data, the more intricate, tedious, and time-consuming the data cleaning can become, especially when implementing it manually.

Ivan handles some of the most common types of data clean work and shows a simple way of implementing these.

Comments closed

Building a GitHub Codespace Configuration for Polyglot Notebooks

Matt Eland makes some recommendations:

In order to get Polyglot Notebooks to work with GitHub Codespaces, you’ll need to match the current requirements of the Polyglot Notebooks extension and its underlying .NET Interactive kernels.

This relies on two files in your .devcontainer directory:

  • Dockerfile which describes the Docker container the Codespace will run in
  • devcontainer.json which describes how the dev container is configured in terms of extensions and ports

Read on to learn more. Also, Matt has a brand new book available on the topic of polyglot notebooks, so check that out.

Comments closed

The Importance of Versioning Data

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

  • past schedules
  • past (recorded) attendance
  • future schedules, and
  • (estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.

Comments closed

Dealing with Collinearity using Lasso Regression

Vinod Chugani always moves in the same direction:

One of the significant challenges statisticians and data scientists face is multicollinearity, particularly its most severe form, perfect multicollinearity. This issue often lurks undetected in large datasets with many features, potentially disguising itself and skewing the results of statistical models.

In this post, we explore the methods for detecting, addressing, and refining models affected by perfect multicollinearity. Through practical analysis and examples, we aim to equip you with the tools necessary to enhance your models’ robustness and interpretability, ensuring that they deliver reliable insights and accurate predictions.

Read on to learn a bit more about how collinearity works and how you can use lasso regression (instead of ridge regression) to deal with the problem.

Comments closed

Sampling without Replacement and Unequal Probabilities

Peter Ellis finds interesting results with sampling in R:

A week ago I was surprised to read on Thomas Lumley’s Biased and Inefficient blog that when using R’s sample() function without replacement and with unequal probabilities of individual units being sampled:

“What R currently has is sequential sampling: if you give it a set of priorities w it will sample an element with probability proportional to w from the population, remove it from the population, then sample with probability proportional to w from the remaining elements, and so on. This is useful, but a lot of people don’t realise that the probability of element i being sampled is not proportional to w_i”

Read on for a demonstration. H/T R-Bloggers.

Comments closed

Explaining a Causal Forest

Michael Mayer wants to suss out the effects of inputs into a causal forest model:

We use a causal forest [1] to model the treatment effect in a randomized controlled clinical trial. Then, we explain this black-box model with usual explainability tools. These will reveal segments where the treatment works better or worse, just like a forest plot, but multivariately.

Read on for the example, as well as several mechanisms you can use to gauge feature relevance.

Comments closed

Random Forest Missing Data Imputation using missRanger

Michael Mayer handles missing data:

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

This looks like an interesting package. At first, I thought it was a way of generating predictions outside the boundaries of training data and had concerns—a classic point (limitation?) of random forest as an algorithm is that it will not even try to predict values outside the range of what it sees in training data, so if the largest label is 10 and the smallest is 0, you won’t see a prediction of 11 or 50, no matter how you scale the inputs.

Instead of doing that, missRanger looks like it’s filling in missing data using a clever approach. That’s quite useful for dealing with incomplete data, a really common problem whose good solutions tend to be complex enough that people typically ignore them in favor of simple but less useful solutions like dropping rows altogether.

Comments closed

Interpreting Linear Regression Model Coefficients

Vinod Chugani looks at a linear regression:

Linear regression models are foundational in machine learning. Merely fitting a straight line and reading the coefficient tells a lot. But how do we extract and interpret the coefficients from these models to understand their impact on predicted outcomes? This post will demonstrate how one can interpret coefficients by exploring various scenarios. We’ll delve into the analysis of a single numerical feature, investigate the role of categorical variables, and unpack the complexities introduced when these features are combined. Through this exploration, we aim to equip you with the skills needed to leverage linear regression models effectively, enhancing your analytical capabilities across different data-driven domains.

Click through for details, with examples in Python.

Comments closed

Time Series Anomaly Detection in Microsoft Fabric

Adi Eldar talks anomaly detection:

Anomaly Detector, one of Azure AI services, enables you to monitor and detect anomalies in your time series data. This service is based on advanced algorithms, SR-CNN for univariate analysis and MTAD-GAT for multivariate analysis. This service is being retired by October 2026, and as part of the migration process

  • The algorithms were open sourced and published by the new time-series-anomaly-detector · PyPI package.
  • We offer a time series anomaly detection workflow in Microsoft Fabric data platform.

Read on to see what replacements exist and how you can use the time-series-anomaly-detector package in Microsoft Fabric.

Comments closed

A Primer on One-Hot Encoding

Vinod Chugani does a bit of data modeling:

Preparing categorical data correctly is a fundamental step in machine learning, particularly when using linear models. One Hot Encoding stands out as a key technique, enabling the transformation of categorical variables into a machine-understandable format. This post tells you why you cannot use a categorical variable directly and demonstrates the use One Hot Encoding in our search for identifying the most predictive categorical features for linear regression.

Read the whole thing.

Comments closed