John Mount shares notes on a theme:
One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.
Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.
Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.
Leave a Comment