Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.
All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.
This does read like an academic paper, so it’s pretty heavy reading. It’s also very good reading from a great writer, so take some time and give it a read if you do data analysis.