Probabilistic Record Linking In Spark

Tom Lous builds a solution to link similar companies together by address:

Recently a colleague asked me to help her with a data problem, that seemed very straightforward at a glance.
She had purchased a small set of data from the chamber of commerce (Kamer van Koophandel: KvK) that contained roughly 50k small sized companies (5–20FTE), which can be hard to find online.
She noticed that many of those companies share the same address, which makes sense, because a lot of those companies tend to cluster in business complexes.

Read on for the solution.  Like many data problems, it turns out to be a lot more complicated than you’d think at first glance.

Related Posts

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R: PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the […]

Read More

Investigating The gcForest Algorithm

William Vorhies describes a new algorithm with strong potential: gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests.  In this case, typically two of […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930