Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data:

We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and in this case a sparse matrix makes most sense. Most developers do not visit most technologies so there are lots of zeroes in our matrix. The tidytext package has a function

`cast_sparse()`

that takes tidy data and casts it to a sparse matrix.`sparse_tag_matrix <- tag_percents %>% tidytext::cast_sparse(User, Tag, Value)`

Several of the implementations for PCA in R are not sparse matrix aware, such as

`prcomp()`

; the first thing it will do is coerce the BEAUTIFUL SPARSE MATRIX you just made into a regular matrix, and then you will be sitting there for one zillion years with no RAM left. (That is a precise and accurate estimate from my benchmarking, obviously.) One option thatdoestake advantage of sparse matrices is the irlba package.

This is a great walkthrough of an important topic.

Kevin Feasel

2018-05-22

Data Science, R