This is the scenario: We have a job scheduler and a related job deployment manager, both implemented based on a state machines framework. One of the scheduler features is preemptable jobs: Jobs of that class can be suspended when a high-priority job needs to be scheduled and there is no available capacity. Effecting preemption requires some involved orchestration between the scheduler and the deployment manager, and we’ve had reliability issues in some cases – both due to incorrectly handled races and latency spikes in the cleanup of the suspended jobs from the cluster. Debugging such issues based on the raw logs has been very tedious – a typical log is 10-30K lines. This gets much worse with the number of dependencies. Given the concurrent processing of the suspensions, tracking the interactions with the new job’s deployment can be mentally taxing. The timeline visualization brought a breakthrough to our debugging ability and productivity. The following sample is a purposefully simplified case. In this scenario, things worked well. It shows the ‘Main’ job, at high priority, waiting on its dependencies to be suspended (while waiting, “Skipped schedule processing” is logged). Shortly after all the suspensions complete, the main job gets to Running state.
Read on to see the scenario in action.