It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the
USA
that is similar with the help of a known dictionary.
The performance hit is often too much for me to accept, though that could just be that I write bad functions.
Comments closed