Press "Enter" to skip to content

Plotting The Effects of Noise on R^2

Tomaz Kastrun messes with R^2:

So, an R-squared of 0.59 might show how well the data fit to the model (hence goodness of fit) and also explains about 59% of the variation in our dependent variable.

Given this logic, we prefer our regression models to have a high R-squared. Yes? Right! And by useless test, with adding random noise to a function, what happens next?

I like Tomaz’s scenario here and think he does a good job demonstrating the outcome. I do, however, struggle with the characterization of “making R^2 useless.” When the error term approaches an enormous value relative to the regressable components, that R^2 is telling you that something else is dominating the relationship between the independent variables and dependent variable. And this is correct: that error term does dominate. I suppose the problem here is philosophical: we call it an error term but what it signifies is “information we don’t understand about the relationship between these variables.” Yes, in this toy example, it was randomly-generated noise. But in a real dataset, it’s not random; it’s inexplicable, at least given the information you know at that time and the mechanisms you use to analyze the relationship.