Spark UDFs in Scala

Achilleus shows us how to create a user-defined function for Spark in Scala, as well as the performance drawbacks:

It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the USA that is similar with the help of a known dictionary.

The performance hit is often too much for me to accept, though that could just be that I write bad functions.

Related Posts

How .NET Code Talks to Spark

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM: 4. Spark-dotnet Java driver listens on tcp portThe spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the […]

Read More

Cloudera and 100% Open Source Software

Alex Woodie notes a change at Cloudera: The old Cloudera developed and distributed its Hadoop stack using a mix of open source and proprietary methods and licenses. But the new Cloudera will be 100% open source, just like Hortonworks, its one-time Hadoop rival that it acquired in January. But will developing its data platform completely […]

Read More

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930