Stop Using word2vec

Chris Moody wants you to stop using word2vec:

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries. Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find them. So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.

Chris has a follow-up post on word tensors as well:

There’s only three steps to computing word tensors. Counting word-word-document skipgrams, normalizing those counts to form the PMI-like M tensor and then factorizing M into smaller matrices.

But to actually perform the factorization we’ll need to generalize the SVD to higher rank tensors 1. Unfortunately, tensor algebra libraries aren’t very common 2. We’ve written one for non-negative sparse tensor factorization, but because the PMI can be both positive and negative it isn’t applicable here. Instead, for this application I’d recommend HOSVD as implemented in scikit-tensor. I’ve also heard good things about tensorly.

I’m going to keep using word2vec for now, but it’s a good pair of posts.

Related Posts

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R: We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and […]

Read More

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates: This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031