Stop Using word2vec

Chris Moody wants you to stop using word2vec:

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries. Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find them. So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.

Chris has a follow-up post on word tensors as well:

There’s only three steps to computing word tensors. Counting word-word-document skipgrams, normalizing those counts to form the PMI-like M tensor and then factorizing M into smaller matrices.

But to actually perform the factorization we’ll need to generalize the SVD to higher rank tensors 1. Unfortunately, tensor algebra libraries aren’t very common 2. We’ve written one for non-negative sparse tensor factorization, but because the PMI can be both positive and negative it isn’t applicable here. Instead, for this application I’d recommend HOSVD as implemented in scikit-tensor. I’ve also heard good things about tensorly.

I’m going to keep using word2vec for now, but it’s a good pair of posts.

Related Posts

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions: One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how […]

Read More

An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome: Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031