Principal Component Analysis Using R

Kevin Feasel



Nina Zumel delves into principal component regression using R (via R Bloggers):

Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.

All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.

This does read like an academic paper, so it’s pretty heavy reading.  It’s also very good reading from a great writer, so take some time and give it a read if you do data analysis.

Related Posts

ggplot2 Geoms And Aesthetics

Tyler Rinker digs into ggplot2’s geoms and aesthetics: I thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one […]

Read More

Legible Function Chaining In R

John Mount shows a few techniques for legible function chaining with R: The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible. My […]

Read More


May 2016
« Apr Jun »