Levenshtein Distances

Peter Coates provides an extremely fast estimate of Levenshtein Distance:

If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:

  • Confirming suspected near-duplication.

  • Estimating how much two document vary.

  • Filtering through large numbers of documents to look for a near-match to some substantial block of text.

The estimation process is pretty interesting.  Worth a read.

Related Posts

Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased: In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data […]

Read More

Comparing Poisson Regression to Regressing Against Logs

Nina Zumel compares a pair of methods for performing regression when income is the dependent variable: Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930