How Spark Determines Task Numbers and Parallelism

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.

Let’s see how Spark decides on the number of tasks with the below set of instructions.

[… instructions]

Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

