Press "Enter" to skip to content

Spark UDFs in Scala

Achilleus shows us how to create a user-defined function for Spark in Scala, as well as the performance drawbacks:

It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the USA that is similar with the help of a known dictionary.

The performance hit is often too much for me to accept, though that could just be that I write bad functions.