How Spark Determines Task Numbers and Parallelism

Published 2021-08-06 by Kevin Feasel

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.
Let’s see how Spark decides on the number of tasks with the below set of instructions.
[… instructions]
Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Published in Hadoop and Spark

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31