Manoj Pandey walks us through the key components in Apache Spark:
1. Spark Driver:
– The Driver program can run various operations in parallel on a Spark cluster.
– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.
– And in parallel it instantiates SparkSession for the Spark Application.
– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.
– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.
Click through for additional elements and how they fit together.