Principal Component Analysis With Faces

2018-08-14

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

2. Calculate the covariance matrix of the data.

3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Linear Regression Assumptions

2019-06-17

Stephanie Glen has a chart which explains the four key assumptions behind when Ordinary Least Squares is the Best Linear Unbiased Estimator: If any of the main assumptions of linear regression are violated, any results or forecasts that you glean from your data will be extremely biased, inefficient or misleading. Navigating all of the different assumptions […]

Visualizing with Heatmaps in R

2019-06-17

Anisa Dhana shows how you can create a quick heatmap plot in R: To give your own colors use the scale_fill_gradientn function.ggplot(dat, aes(Age, Race)) + geom_raster(aes(fill = BMI)) + scale_fill_gradientn(colours=c("white", "red")) This is a quick example using ggplot2 but there are other heatmap libraries available too.