Pinku Swargiary shows us how to configure Spark to use Kryo serialization:
If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations and the grouping operations are where serialization has an impact on and they usually have data shuffling. Now lesser the amount of data to be shuffled, the faster will be the operation.
Caching also have an impact when caching to disk or when data is spilled over from memory to disk.Also, if we look at the size metrics below for both Java and Kryo, we can see the difference.
Sounds like it’s better overall but requires some custom configuration.