Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data:
We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and in this case a sparse matrix makes most sense. Most developers do not visit most technologies so there are lots of zeroes in our matrix. The tidytext package has a function
cast_sparse()
that takes tidy data and casts it to a sparse matrix.sparse_tag_matrix <- tag_percents %>% tidytext::cast_sparse(User, Tag, Value)
Several of the implementations for PCA in R are not sparse matrix aware, such as
prcomp()
; the first thing it will do is coerce the BEAUTIFUL SPARSE MATRIX you just made into a regular matrix, and then you will be sitting there for one zillion years with no RAM left. (That is a precise and accurate estimate from my benchmarking, obviously.) One option that does take advantage of sparse matrices is the irlba package.
This is a great walkthrough of an important topic.
Comments closed