Data Science – Page 19

Building a Multinomial Distribution in R

Published 2023-11-02 by Kevin Feasel

Steven Sanderson isn’t satisfied with the binomial:

The multinomial distribution is a probability distribution that describes the probability of obtaining a specific number of counts for k different outcomes, when each outcome has a fixed probability of occurring.

In R, we can use the rmultinom() function to simulate random samples from a multinomial distribution, and the dmultinom() function to calculate the probability of a specific outcome.

Click through to see how you can build a multinomial distribution and what the difference is between rmultinom() and dmultinom().

Comments closed

Uniform Random Number Generation in R

Published 2023-11-01 by Kevin Feasel

Steven Sanderson digs into the uniform distribution:

Randomness is an essential part of many statistical and machine learning tasks. In R, there are a number of functions that can be used to generate random numbers, but the runif() function is the most commonly used.

Something mildly embarrassing for me is that it took me a while to figure out why they call the command runif(). That’s because, at first, I didn’t pronounce it r unif but rather run if.

In reality, *unif() means “uniform distribution” and r stands for “random number.” There are several other functions based on the uniform distribution and Steven looks at those as well in this post.

Comments closed

An Analysis of Goal Line Runs out of Shotgun

Published 2023-10-31 by Kevin Feasel

I decided to test a common narrative:

A common theme among Buffalo Bills fans is the idea that the Bills run too many plays out of shotgun near the opposing team’s goal line, and this is hampering their ability to score points. Instead, these fans argue, they should run from under center, either a direct handoff or a quarterback sneak. If you were to press fans on this, I believe you’d also hear that the Bills are unique, or at least uniquely bad, at running such plays.

I’m going to use the nflfastR package to analyze play-by-play data and see just how well this bit of fan wisdom holds up.

Spoiler alert: it doesn’t.

Comments closed

Plotting Logistic Regression in R

Published 2023-10-30 by Kevin Feasel

Steven Sanderson performs a logistic regression:

Logistic regression is a statistical method used for predicting the probability of a binary outcome. It’s a fundamental tool in machine learning and statistics, often employed in various fields such as healthcare, finance, and marketing. We use logistic regression when we want to understand the relationship between one or more independent variables and a binary outcome, which can be “yes/no,” “1/0,” or any two-class distinction.

Click through to learn how to do this.

Comments closed

Building a Bland-Altman Plot in R

Published 2023-10-27 by Kevin Feasel

Steven Sanderson performs a comparison:

Before we dive into the code, let’s briefly understand what a Bland-Altman plot is. It’s a graphical method to visualize the agreement between two measurement techniques, often used in fields like medicine or any domain with comparative measurements. The plot displays the differences between two measurements (Y-axis) against their means (X-axis).

Click through to see how this works and how you can interpret the results.

Comments closed

Making a Time Series Stationary in R

Published 2023-10-20 by Kevin Feasel

Steven Sanderson puts a halt to things:

When working with time series data, one common challenge is dealing with non-stationary data. Non-stationary time series can be a headache for analysts, but fear not, because we have a handy tool to make your life easier. Say hello to the auto_stationarize() function from the {healthyR.ts} package.

Read on to learn why you want stationary data for time series analysis and how the auto_stationarize() function works.

Comments closed

Time Series Stationarity Testing in R

Published 2023-10-18 by Kevin Feasel

Steven Sanderson isn’t just spinning in place:

Before we delve into the ts_adf_test() function, let’s understand the concept behind it. The Augmented Dickey-Fuller (ADF) test is a crucial tool in time series analysis. It’s like the Sherlock Holmes of time series data, helping us detect whether a series is stationary or not. Stationarity is a fundamental assumption in time series modeling because many models work best when applied to stationary data.

So, why “Augmented”? Well, it’s an extension of the original Dickey-Fuller test that accounts for more complex relationships within the time series data.

Click through to see how you can use the ts_adf_test() function to get a better feel for whether a time series is stationary.

Comments closed

A Primer on A/B Testing for Engineers

Published 2023-10-18 by Kevin Feasel

John Mount performs some testing:

I’d like to discuss a simple variation of A/B testing in an engineering style.
By “an engineering style” I mean:

We will work a simulated example to see that the system works as claimed.

We will exhibit examples of problems before trying to fix them.

We will demonstrate all of the top level claims as calculations, and not delegate these to references.

We will leave fundamental math to the references, and not try to re-derive it.

In my opinion far too few A/B testing treatments check soundness, even on simulated data. This makes it easy for such articles to leave out important steps. If a relied on reference omits a step, the derived work may have to do the same.
We will implement the experiment design directly, instead of using a canned power calculator so we have a place to discuss some of the design issues in A/B test design.

This is an excellent dive into the topic and I highly recommend taking the time to read it.

Comments closed

New R Package: hstats

Published 2023-10-17 by Kevin Feasel

Michael Mayer has a new package:

The current version offers:

H statistics per feature, feature pair, and feature triple

multivariate predictions at no additional cost

a convenient API

other important tools from explainable ML:

performance calculations

permutation importance (e.g., to select features for calculating H-statistics)

partial dependence plots (including grouping, multivariate, multivariable)

individual conditional expectations (ICE)

Case-weights are available for all methods, which is important, e.g., in insurance applications.

Click through for an example of how it works, followed by some simple benchmarking to give you an idea of how it performs compared to similar tools.

Comments closed

Data Science in Microsoft Fabric

Published 2023-10-13 by Kevin Feasel

Reza Rad has a new video and article for us:

Since Microsoft Fabric is an end-to-end analytics service, it makes sense to have the means for data science, too. Here are tools, Fabric items, and libraries that help the Data Science to happen, or in other words, parts of the Data Science workload of Microsoft Fabric.

Reza also has plenty of links to other articles and videos for more detail.

Comments closed

Category: Data Science