Vidisha Gupta has a few tips for tuning Apache Spark programs:
Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –
-
Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
-
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation
There are some good tips in here.