Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.
In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
There are a few metrics of note here. Check it out.