Karunanithi Shanmugam gives us some tips on memory management for Spark in Amazon’s ElasticMapReduce:
Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.
To use all the resources available in a cluster, set the
maximizeResourceAllocation
parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in thespark-defaults
settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default forspark.default.parallelism
is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).
There’s a lot in here, much of which applies to Spark in general and not just EMR.