Press "Enter" to skip to content

Category: Data Science

Supply Chain Analysis in R via planr

Matt Dancho shows off an R package:

Supply chain management is all about balancing supply and demand to ensure that inventory levels are optimized. Overestimating demand leads to excess stock, while underestimating it causes shortages. Accurate inventory projections allow businesses to plan ahead, make data-driven decisions, and avoid costly errors like over-buying inventory or getting into a stock-outage and having no inventory to meet demand.

Read on to learn more about the package and how it works. H/T R-Bloggers.

Comments closed

Smoothing Functions in R

Ivan Svetunkov puts on the forecasting hat:

I have been asked recently by a colleague of mine how to extract the variance from a model estimated using adam() function from the smooth package in R. The problem was that that person started reading the source code of the forecast.adam() and got lost between the lines (this happens to me as well sometimes). Well, there is an easier solution, and in this post I want to summarise several methods that I have implemented in the smooth package for forecasting functions. In this post I will focus on the adam() function, although all of them work for es() and msarima() as well, and some of them work for other functions (at least as for now, for smooth v4.1.0). Also, some of them are mentioned in the Cheat sheet for adam() function of my monograph (available online).

Read on to learn more. H/T R-Bloggers.

Comments closed

An Overview of k Nearest Neighbors

Harris Amjad explains a common algorithm for classification:

It so happens that given the hype of Machine Learning (ML) and especially Large Language Models these days, there is a considerable proportion of those who wish to understand how these systems work from scratch. Unfortunately, more often than not, the interest fades away quickly as learners jump to complicated algorithms like neural networks and transformers first, without giving heed to traditional ML algorithms that paved the foundation for these advanced algorithms in the first place. In this tip, we will introduce and implement the K-Nearest Neighbors model in Python. Although it is quite old, it remains very popular due to its simplicity and intuitiveness.

Click through to learn more about this algorithm, including an implementation from scratch in Python.

Comments closed

An Overview of LightGBM

Vinod Chugani continues a series on tree-based classification techniques:

LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.

In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.

Read on to learn more about LightGBM as an algorithm, as well as how to use it.

Comments closed

Plotting the ROC Curve in Microsoft Fabric

Tomaz Kastrun gets plotting:

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Read on to see how you can generate one in a Microsoft Fabric notebook. Tomaz also plots a density function for additional fun.

Comments closed

Handling Missing Data with XGBoost

Vinod Chugani is missing a few data points:

XGBoost has gained widespread recognition for its impressive performance in numerous Kaggle competitions, making it a favored choice for tackling complex machine learning challenges. Known for its efficiency in handling large datasets, this powerful algorithm stands out for its practicality and effectiveness.

In this post, we will apply XGBoost to the Ames Housing dataset to demonstrate its unique capabilities. Building on our prior discussion of the Gradient Boosting Regressor (GBR), we will explore key features that differentiate XGBoost from GBR, including its advanced approach to managing missing values and categorical data.

Read on to see how it fares.

Comments closed

Boosting versus Bagging in Tree Models

Vinod Chugani compares two techniques for working with trees:

Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging improves stability and accuracy by aggregating independent predictions, whereas boosting sequentially corrects the errors of prior models, improving their performance with each iteration. This post begins our deep dive into boosting, starting with the Gradient Boosting Regressor. Through its application on the Ames Housing Dataset, we will demonstrate how boosting uniquely enhances models, setting the stage for exploring various boosting techniques in upcoming posts.

Read on for more information. The neat part about the “boosting versus bagging” debate is that both techniques are quite useful. Although boosting (via algorithms like XGBoost or LightGBM) is the more popular technique, bagging (random forest) is extremely powerful in its own right.

Comments closed

Simple Outlier Detection and Removal in R

Steven Sanderson looks for oddities:

Outliers can significantly skew your data analysis results, leading to inaccurate conclusions. For R programmers, effectively identifying and removing outliers is crucial for maintaining data integrity. This guide will walk you through various methods to handle outliers in R, focusing on multiple columns, using a synthetic dataset for demonstration.

The techniques Steven uses are perfectly reasonable (though I like to use MAD from the median rather than standard deviations from the mean because MAD from the median doesn’t suffer from the sorts of endogeneity problems standard deviation does in a dynamic process). My primary warning would be to keep outliers in a dataset unless you know why you’re removing them. If you know the values were impossible or wrong—for example, a person who works 500 hours a week—that’s one thing. But sometimes, you get exceptional values out of an ordinary process, and those values are just as real as any other. I might have had a sequence in which I flipped a fair coin and it landed on heads 10 times in a row. It’s statistically very uncommon, but that doesn’t mean you can ignore it as a possibility or pretend it didn’t happen.

Comments closed

Fun with the Cauchy Distribution

John Cook checks out fat tails:

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

This is one of my favorite distributions to mess with, though I’m glad few things naturally follow a Lorentzian. I had a colleague of mine build an Excel spreadsheet to describe the behavior, showing that results never converge and you could easily pull values 50+ “standard deviations” from the “mean.” I use scare quotes there because neither concept actually exists for the distribution, but if you mistake it for an awkwardly thin normal distribution, it can lead to some interesting results.

Comments closed

Natural Language Pre-Processing with Python

Harris Amjad does some text cleanup:

Natural Language Processing (NLP) is currently all the rage in the current machine learning landscape. With technologies like ChatGPT, Gemini, Llama, and so many other state-of-the-art text generators getting popular with the mainstream public, many newcomers are pouring into the field of NLP. Unfortunately, before we delve into how these fancy chatbots work, we must understand how we are engineering and treating our data before we feed it to our model. In this tip, we will introduce and implement some basic text preprocessing and cleaning techniques with Python.

Click through for some common operations. Some of these are very important for certain tasks but likely unhelpful for others. That could include things like lower-casing all words or removing stopwords. There are also some operations like spell checking and jargon expansion (or replacement) that you will likely want to include in a real-life project with actual people entering the data, versus a tidy sample dataset.

Comments closed