Principal Component Analysis With Stack Overflow Data

Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data:

We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and in this case a sparse matrix makes most sense. Most developers do not visit most technologies so there are lots of zeroes in our matrix. The tidytext package has a function cast_sparse() that takes tidy data and casts it to a sparse matrix.

sparse_tag_matrix <- tag_percents %>% tidytext::cast_sparse(User, Tag, Value)

Several of the implementations for PCA in R are not sparse matrix aware, such as prcomp(); the first thing it will do is coerce the BEAUTIFUL SPARSE MATRIX you just made into a regular matrix, and then you will be sitting there for one zillion years with no RAM left. (That is a precise and accurate estimate from my benchmarking, obviously.) One option that does take advantage of sparse matrices is the irlba package.

This is a great walkthrough of an important topic.

Related Posts

Interactive ggplot Plots with plotly

Laura Ellis takes us through ggplotly: As someone very interested in storytelling, ggplot2 is easily my data visualization tool of choice. It is like the Swiss army knife for data visualization. One of my favorite features is the ability to pack a graph chock-full of dimensions. This ability is incredibly handy during the data exploration […]

Read More

Goodbye, gather and spread; Hello pivot_long and pivot_wide

John Mount covers a change in tidyr which mimics Mount and Nina Zumel’s pivot_to_rowrecs and unpivot_to_blocks functions in the cdata package: If you want to work in the above way we suggest giving our cdatapackage a try. We named the functions pivot_to_rowrecs and unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms […]

Read More

Categories

May 2018
MTWTFSS
« Apr Jun »
 123456
78910111213
14151617181920
21222324252627
28293031