John Mount has a nice essay on overfitting:
What is meant by “overfitting” is: the estimated
f()
will tend to show off or over perform on the data used to fit, train, or construct it. I have some notes on this sort of selection bias here: https://win-vector.com/2020/12/10/overfit-and-reversion-to-mediocrity-the-bane-of-data-science/.Selecting a model that “looks good” is enough to bias the model’s evaluation with respect to the data set we said it “looked good” on. So even when using unbiased methods, the data scientist can introduce bias by choosing to use one model (say the one fit by logistic regression) over another (say using using an observed prevalence everywhere as a probability prediction).
The way I talk about overfitting is to say that we’ve trained a model which latches onto the particulars of the training data set. To the extent that the particulars of the training data set are matched by the broader world, that’s “fitting.” To the extent that the particulars of the training data set are unique to that data set and are not generally applicable, that’s “overfitting.” Generally, I don’t have any more time to get into what this means, but John dives into the topic in an accessible way.