Improving Apache Flink Scheduler Performance

Zhilong Hong, et al, share some interesting results out of Apache Flink 1.14. Part one lays out the scene:

To estimate the effect of our optimizations, we conducted several experiments to compare the performance of Flink 1.12 (before the optimization) with Flink 1.14 (after the optimization). The job in our experiments contains two vertices connected with an all-to-all edge. The parallelisms of these vertices are both 10K. To make temporary deployment descriptors distributed via the blob server, we set the configuration blob.offload.minsize to 100 KiB (from default value 1 MiB). This configuration means that the blobs larger than the set value will be distributed via the blob server, and the size of deployment descriptors in our test job is about 270 KiB. The results of our experiments are illustrated below:

Part two explains their improvements:

In Flink 1.12, the ExecutionEdge class is used to store the information of connections between tasks. This means that for the all-to-all distribution pattern, there would be O(n²) ExecutionEdges, which would take up a lot of memory for large-scale jobs. For two JobVertices connected with an all-to-all edge and a parallelism of 10K, it would take more than 4 GiB memory to store 100M ExecutionEdges. Since there can be multiple all-to-all connections between vertices in production jobs, the amount of memory required would increase rapidly.
As we can see in Fig. 1, for two JobVertices connected with the all-to-all distribution pattern, all IntermediateResultPartitions produced by upstream ExecutionVertices are isomorphic, which means that the downstream ExecutionVertices they connect to are exactly the same. The downstream ExecutionVertices belonging to the same JobVertex are also isomorphic, as the upstream IntermediateResultPartitions they connect to are the same too. Since every JobEdge has exactly one distribution type, we can divide vertices and result partitions into groups according to the distribution type of the JobEdge.

Click through for a dive into the architecture.

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31