Press "Enter" to skip to content

Category: Data Science

Transferring Linear Model Coefficients

Nina Zumel performs a swap:

A quick glance through the scikit-learn documentation on linear models, or the CRAN task view on Mixed, Multilevel, and Hierarchical Models in R reveals a number of different procedures for fitting models with linear structure. Each of these procedures meet different needs and constraints, and some of them can be computationally intensive to compute. But in the end, they all have the same underlying structure: outcome is modelled as a linear combination of input features.

But the existence of so many different algorithms, and their associated software, can obscure the fact that just because two models were fit differently, they don’t have to be run differently. The fitting implementation and the deployment implementation can be distinct. In this note, we’ll talk about transferring the coefficients of a linear model to a fresh model, without a full retraining.

I had a similar problem about 18 months ago, though much easier than the one Nina describes, as I did have access to the original data and simply needed to build a linear regression in Python that matched exactly the one they developed in R. Turns out that’s not as easy to do as you might think: the different languages have different default assumptions that make the results similar but not the same, and piecing all of this together took a bit of sleuthing.

Comments closed

A/B Testing with Survival Analysis in R

Iyar Lin combines two great flavors:

Usually when running an A/B test analysts assign users randomly to variants over time and measure conversion rate as the ratio between the number of conversions and the number of users in each variant. Users who just entered the test and those who are in the test for 2 weeks get the same weight.

This can be enough for cases where a conversion either happens or not within a short time frame after assignment to a variant (e.g. Finishing an on-boarding flow).

There are however many instances where conversions are spread over a longer time frame. One example would be first order after visiting a site landing page. Such conversions may happen within minutes, but a large churn could also happen within days after the first visit.

Read on for the scenario, as well as a simulation. I will note that, in the digital marketing industry, there’s usually a hard cap on number of days where you’re able to attribute a conversion to some action for exactly the reason Iyar mentions. H/T R-Bloggers.

Comments closed

Random Walks in R with TidyDensity

Steven Sanderson goes for a walk:

A random walk is a mathematical object that describes a path consisting of a succession of random steps. It’s a cornerstone concept in fields like physics, economics, and biology. In finance, for example, the random walk hypothesis suggests that stock market prices evolve according to a random walk and thus cannot be predicted.

Read on to see how you can generate a dataset matching a random walk, as well as a comparison of techniques for generating them.

Comments closed

Tips for Choosing a Classifier

I’ve wrapped up yet another series:

In this video, I wrap up the series on classification and provide some quick-and-dirty tips on when to use each of the classification algorithms we have discussed.

This was a series I really enjoyed. I’ve had a talk on the topic for a few years, but getting the opportunity to dig in deeper and spend a few hours on the topic was nice. It also helped me fill in some gaps in my understanding and fix a few long-standing bugs in my demo code, so it’s got that going for it as well.

Comments closed

SHAP and Additive Models

Michael Mayer answers a pair of related questions:

Within only a few years, SHAP (Shapley additive explanations) has emerged as the number 1 way to investigate black-box models. The basic idea is to decompose model predictions into additive contributions of the features in a fair way. Studying decompositions of many predictions allows to derive global properties of the model.

What happens if we apply SHAP algorithms to additive models? Why would this ever make sense?

Read on for the answers to these two questions.

Comments closed

Random Walks and Brownian Motion in healthyR.ts

Steven Sanderson goes for a walk on the stock exchange:

In the world of time series analysis, Random Walks, Brownian Motion, and Geometric Brownian Motion are fundamental concepts used in various fields, including finance, physics, and biology. Today, we’ll explore these concepts using functions from the healthyR.ts package.

Click through to learn about each of these concepts and some examples of how you can generate time series datasets following each of them.

Comments closed

New Video: Multi-Class Classification

I have a new video:

In this video, I get past two-class classification and explain how things differ in the multi-class world.

What’s really interesting is that, in many cases, when it comes to code, the answer is “not much.” That’s because libraries like scikit-learn do a lot to smooth over differences between single-class and multi-class classification. But there are still differences that can bite you if you don’t understand how the cases differ.

Comments closed

An Introduction to the healthyR.ai Package

Steven Sanderson explains the purpose of a package:

The ultimate goal really is to make it easier to do data analysis and machine learning in R. The package is designed to be easy to use and to provide a wide range of functionality for data analysis. The package is also meant to help and provide some easy boilerplate functionality for machine learning. This package is in its early stages and will be updated frequently.

It also keeps with the same framework of all of the healthyverse packages in that it is meant for the user to be able to use the package without having to know a lot of R. Many rural hospitals do not have the resources to perform this sort of work, so I am working hard to build these types of things out for them for free.

Read on to see how it works, including several examples of the package in action.

Comments closed

Practical healthyR.ts Examples

Steven Sanderson provides some examples:

Today I am going to go over some quick yet practical examples of ways that you can use the healthyR.ts package. This package is designed to help you analyze time series data in a more efficient and effective manner.

Let’s just jump right into it!

Read on for a few common time series activities, such as testing for stationarity, extracting tends from noise, and performing lagged correlation.

Comments closed

New Video: Online Passive-Aggressive Algorithms

I have a new video:

In this video, I cover the series of classification algorithms with the best possible name: online passive-aggressive algorithms.

I remember, when reading up on this, being incredulous that the idea even worked. But it turns out that it’s actually pretty good in practice, especially on constrained hardware. Still, this is definitely an algorithm you’d want to test in comparison to others before jumping right in, as there’s a risk you can end up with terrible results.

Comments closed