Probabilistic Record Linking In Spark

Tom Lous builds a solution to link similar companies together by address:

Recently a colleague asked me to help her with a data problem, that seemed very straightforward at a glance.
She had purchased a small set of data from the chamber of commerce (Kamer van Koophandel: KvK) that contained roughly 50k small sized companies (5–20FTE), which can be hard to find online.
She noticed that many of those companies share the same address, which makes sense, because a lot of those companies tend to cluster in business complexes.

Read on for the solution.  Like many data problems, it turns out to be a lot more complicated than you’d think at first glance.

Related Posts

Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark: Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only […]

Read More

Linear Prediction Confidence Region Flare-Out

John Cook explains why the confidence region of a tracked object flares out instead of looking conical (or some other shape): Suppose you’re tracking some object based on its initial position x0 and initial velocity v0. The initial position and initial velocity are estimated from normal distributions with standard deviations σx and σv. (To keep […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930