Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs:

Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –

  • Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.

  • Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer  Kryo documentation

There are some good tips in here.

Related Posts

How .NET Code Talks to Spark

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM: 4. Spark-dotnet Java driver listens on tcp portThe spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the […]

Read More

Cloudera and 100% Open Source Software

Alex Woodie notes a change at Cloudera: The old Cloudera developed and distributed its Hadoop stack using a mix of open source and proprietary methods and licenses. But the new Cloudera will be 100% open source, just like Hortonworks, its one-time Hadoop rival that it acquired in January. But will developing its data platform completely […]

Read More

Categories

November 2018
MTWTFSS
« Oct Dec »
 1234
567891011
12131415161718
19202122232425
2627282930