Levenshtein Distances

Peter Coates provides an extremely fast estimate of Levenshtein Distance:

If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:

  • Confirming suspected near-duplication.

  • Estimating how much two document vary.

  • Filtering through large numbers of documents to look for a near-match to some substantial block of text.

The estimation process is pretty interesting.  Worth a read.

Related Posts

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way: Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, […]

Read More

Classifying Texts With Naive Bayes

I continue my series on Naive Bayes with another hand-calculation post: Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930