Data Science – Page 10

Cross-Correlation of Time Series to Identify Time Lags in SAS

Published 2024-05-28 by Kevin Feasel

Kevin Scott and David Frede notice the pattern:

Batch manufacturing involves producing goods in batches rather than in a continuous stream. This approach is common in industries such as pharmaceuticals, chemicals, and materials processing, where precise control over the production process is essential to ensure product quality and consistency. One critical aspect of batch manufacturing is the need to manage and understand inherent time delays that occur at various stages of the process.

In the glass manufacturing industry, which operates under the principles of batch manufacturing, precisely controlling the furnace temperature is essential for producing high-quality glass. The process involves melting raw materials like silica sand, soda ash, and limestone at high temperatures, where maintaining the correct temperature is crucial.

Read on to see an example of how you can automate the identification of a time lag using cross-correlation techniques.

Comments closed

An Overview of Classification Algorithms

Published 2024-05-24 by Kevin Feasel

Matthew Mayo explains several algorithms:

Classification algorithms are at the heart of data science, helping us categorize and organize data into pre-defined classes. These algorithms are used in a wide array of applications, from spam detection and medical diagnosis to image recognition and customer profiling. It is for this reason that those new to data science must know about and understand these algorithms: they lay foundations for more advanced techniques and provide insight into how those data-driven decisions are made.

Let’s take a look at 5 essential classification algorithms, explained intuitively. We will include resources for each to learn more if interested.

Click through for five algorithms and a couple of paragraphs describing how the algorithm works. For a little bit of self-promotion on my end, I have a series on YouTube running right now on the topic of classification where I cover a variety of algorithms. As a spoiler, 4 of the 5 on Matthew’s list will have their own videos, and there are several other algorithms to boot.

Comments closed

Accuracy is Not Enough for Classification

Published 2024-05-22 by Kevin Feasel

I have a new video:

In this video, I explain why accuracy is not the be-all, end-all measure for classification. After that, I introduce the confusion matrix, a mechanism for tracking predicted versus actual values. Then, I talk about a variety of measures and how we can derive them from the confusion matrix.

The trickiest part of the confusion matrix measures is just remembering which measures comport to which combinations in the matrix. The second-trickiest part of the confusion matrix is that R and Python invert them, so reading across the top row in R is equivalent to reading down the first column in Python.

Comments closed

Gradient Boosting for Classification

Published 2024-05-15 by Kevin Feasel

I have a new video:

In this video, I take a look at an alternative to bootstrap aggregation & random forest: boosting. We cover a brief history of boosting and see how it works in action with XGBoost and LightGBM.

This is probably the video with the single largest number of links in my show notes. It’s also one of the shortest in the series; it’s funny how things work out sometimes.

Comments closed

Quick Takes on Logistic Regression

Published 2024-05-10 by Kevin Feasel

John Cook talks about my favorite form of regression that serves to solve classification problems:

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

Read on for a few more thoughts on and around logistic regression and logits from a mathematician.

Comments closed

Classification with Random Forest

Published 2024-05-09 by Kevin Feasel

I have a new video:

In this video, I cover a powerful ensemble method for classification: random forests. We get an idea of how this differs from CART, learn the best possible metaphor for random forests, and dig into random search for hyperparameter optimization.

Click through to see the video in all its glory.

Comments closed

Model Selection with AIC

Published 2024-05-07 by Kevin Feasel

Steven Sanderson talks about the Akaike Information Criterion:

In the world of data analysis and statistics, one of the key challenges is selecting the best model to describe and analyze your data. This decision is crucial because it impacts the accuracy and reliability of your results. Among the many tools available, the Akaike Information Criterion (AIC) stands out as a powerful method for comparing different models and choosing the most suitable one.

Today we will go through an example of model selection using the AIC, specifically focusing on its application to various statistical distributions available in the TidyDensity package. TidyDensity, a part of the healthyverse ecosystem, offers a comprehensive suite of tools for data analysis in R, including functions to compute AIC scores for different probability distributions.

Read on for a quick primer on the AIC itself and how you can use it in TidyDensity.

Comments closed

MCMC Sampling with TidyDensity

Published 2024-05-06 by Kevin Feasel

Steven Sanderson performs some sampling:

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Read on for a brief primer on MCMC and an example of how the tidy_mcmc_sampling() function works.

Comments closed

Quantile Normalization with TidyDensity

Published 2024-05-02 by Kevin Feasel

Steven Sanderson achieves normality:

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

Read on to see how you can use the TidyDensity package to pull this off.

Comments closed

Classification Concepts and CART in Action

Published 2024-05-01 by Kevin Feasel

I have a new video series:

In this video, I explain some core concepts behind classification and introduce the first classification algorithm we will look at in CART.

CART, by the way, stands for Classification and Regression Trees, and is one of the easiest classification algorithms to understand as a concept: it’s a decision tree (aka, a series of if-else statements) where each terminal node is an outcome: either a class for classification or a value for regression.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science