Handling Definitional Changes In Predictive Variables

Vincent Granville explains how you can blend two different definitions of a variable of interest together:

The reasons why scores can become meaningless over time is because data evolves. New features (variables) are added that were not available before, the definition of a metric is suddenly changed (for instance, the way income is measured) resulting in new data not compatible with prior data, and faulty scores. Also, when external data is gathered across multiple sources, each source may compute it differently, resulting in incompatibilities: for instance, when comparing individual credit scores from two people that are costumers at two different banks, each bank computes base metrics (income, recency, net worth, and so on) used to build the score, in a different way. Sometimes the issue is caused by missing data, especially when users with missing data are very different from those with full data attached to them.

Click through for a description of the approach and links showing how it works in practice.

Related Posts

Lasso and Ridge Regression in Python

Kristian Larsen shows off a few regression techniques using Python: Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a […]

Read More

Using Cohen’s D for Experiments

Nina Zumel takes us through Cohen’s D, a useful tool for determining effect sizes in experiments: Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined asd = | μ1 – μ2 | / σpooledwhere σpooled is the pooled standard deviation over both cohorts. […]

Read More


February 2019
« Jan Mar »