Data Science – Page 15

Model Selection with AIC

Published 2024-05-07 by Kevin Feasel

Steven Sanderson talks about the Akaike Information Criterion:

In the world of data analysis and statistics, one of the key challenges is selecting the best model to describe and analyze your data. This decision is crucial because it impacts the accuracy and reliability of your results. Among the many tools available, the Akaike Information Criterion (AIC) stands out as a powerful method for comparing different models and choosing the most suitable one.

Today we will go through an example of model selection using the AIC, specifically focusing on its application to various statistical distributions available in the TidyDensity package. TidyDensity, a part of the healthyverse ecosystem, offers a comprehensive suite of tools for data analysis in R, including functions to compute AIC scores for different probability distributions.

Read on for a quick primer on the AIC itself and how you can use it in TidyDensity.

Comments closed

MCMC Sampling with TidyDensity

Published 2024-05-06 by Kevin Feasel

Steven Sanderson performs some sampling:

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Read on for a brief primer on MCMC and an example of how the tidy_mcmc_sampling() function works.

Comments closed

Quantile Normalization with TidyDensity

Published 2024-05-02 by Kevin Feasel

Steven Sanderson achieves normality:

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

Read on to see how you can use the TidyDensity package to pull this off.

Comments closed

Classification Concepts and CART in Action

Published 2024-05-01 by Kevin Feasel

I have a new video series:

In this video, I explain some core concepts behind classification and introduce the first classification algorithm we will look at in CART.

CART, by the way, stands for Classification and Regression Trees, and is one of the easiest classification algorithms to understand as a concept: it’s a decision tree (aka, a series of if-else statements) where each terminal node is an outcome: either a class for classification or a value for regression.

Comments closed

Specifying Follow-Up Times for Longitudinal Data in simstudy

Published 2024-04-17 by Kevin Feasel

Keith Goldfield updates the simstudy package:

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are {0,1,2,…,n−1}{0,1,2,…,n−1}, where there are n total periods. However, when follow-up periods required more specificity, such as {0,90,180,365}{0,90,180,365} days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

Read on to see how it works. H/T R-Bloggers.

Comments closed

Estimating Chi-Square Parameters with R

Published 2024-04-16 by Kevin Feasel

Steven Sanderson performs a test:

In the world of statistics and data analysis, understanding and accurately estimating the parameters of probability distributions is crucial. One such distribution is the chi-square distribution, often encountered in various statistical analyses. In this blog post, we’ll dive into how we can estimate the degrees of freedom (“df”) and the non-centrality parameter (“ncp”) of a chi-square distribution using R programming language.

Read on to learn more about the process of estimation while I grumble something about Bayesian analysis being better.

Comments closed

Multidimensional Scaling in R

Published 2024-04-05 by Kevin Feasel

Steven Sanderson is from the 5th dimension:

Visualizing similarities between data points can be tricky, especially when dealing with many features. This is where multidimensional scaling (MDS) comes in handy. It allows us to explore these relationships in a lower-dimensional space, typically 2D or 3D for easier interpretation. In R, the cmdscale() function from base R and is a great tool for performing classical MDS.

Click through to see how this works. In case you’re curious, cmdscale() is an example of principal coordinates analysis. If you’re familiar with principal components analysis, that’s a different form of multidimensional scaling.

Comments closed

Normalizing Data in R

Published 2024-04-03 by Kevin Feasel

Steven Sanderson says, act normal:

Data normalization is a crucial preprocessing step in data analysis and machine learning workflows. It helps in standardizing the scale of numeric features, ensuring fair treatment to all variables regardless of their magnitude. In this tutorial, we’ll explore how to normalize data in R using practical examples and step-by-step explanations.

Read on for a definition of what this means and how you can do it.

Comments closed

Quantile Normalization in R

Published 2024-04-01 by Kevin Feasel

Steven Sanderson has achieved normality:

Before we dive into the code, let’s understand the concept behind quantile normalization. At its core, quantile normalization aims to equalize the distributions of multiple datasets by aligning their quantiles. This ensures that each dataset has the same distribution of values, making meaningful comparisons possible.

This is a bit different from normalizing individual data points in one dataset, as you can see in the post.

Comments closed

A Bayesian Approach to CATPCHAs

Published 2024-03-29 by Kevin Feasel

John Cook claims to be human:

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

Read on to see how a Bayesian approach to the problem could make things a bit less annoying.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Category: Data Science