Monitoring Apache Spark

Swaroop Ramachandra has started a series on monitoring Apache Spark:

Spark provides metrics for each of the above components through different endpoints. For example, if you want to look at the Spark driver details, you need to know the exact URL, which keeps changing over time–Spark keeps you guessing on the URL. The typical problem is when you start your driver in cluster mode. How do you detect on which worker node the driver was started? Once there, how do you identify the port on which the Spark driver exposes its UI? This seems to be a common annoying issue for most developers and DevOps professionals who are managing Spark clusters. In fact, most end up running their driver in client mode as a workaround, so they have a fixed URL endpoint to look at. However, this is being done at the cost of losing failover protection for the driver. Your monitoring solution should be automatically able to figure out where the driver for your application is running, find out the port for the application and automatically configure itself to start collecting metrics.

For a dynamic infrastructure like Spark, your cluster can get resized on the fly. You must ensure your newly spawned components (Workers, executors) are automatically configured for monitoring. There is no room for manual intervention here. You shouldn’t miss out monitoring newer processes that show up on the cluster. On the other hand, you shouldn’t be generating false alerts when executors get moved around. A general monitoring solution will typically start alerting you if an executor gets killed and starts up on a new worker–this is because generic monitoring solutions just monitor your port to check if it’s up or down. With a real time streaming system like Spark, the core idea is that things can move around all the time.

Spark does add a bit of complexity to monitoring, but there are solutions in place.  Read the whole thing.

Related Posts

Changing Int To Bigint

Danny Kruge shows one way to change a table’s identity value from integer to bigint: The table was around 500GB with over 900 million rows. Based on the average number of inserts a day on that table, I estimated that we had eight months before inserts on that table would grind to a halt. This […]

Read More

Changing The Default Filegroup

Kenneth Fisher shows how you can change the default filegroup: You know you can have multiple filegroups right? You might have a separate filegroup for the data (the clustered index & heaps) and another for the indexes (non-clustered indexes). Or maybe you want to separate your data tables from the system tables. There are any […]

Read More


June 2016
« May Jul »