Spark RDD to Task mapping

Understanding how RDD are converted to Task

RDD is the basic unit of spark jobs and stands for resilient distributed data set. RDD has following attributes

Distributed           :  RDD is distributed in nature ie it spans across multiple nodes to run a Task.
Resilient               :  In case of any failures of any task, RDD takes care of re-running it.
Transformation     :  A chain of funcation can be applied to RDD which internally will be executed in a distributed way.
Lazy Evaluation   :  Until a action is needed RDD will never be executed.
Parallel factor       :  RDD inherits the parallelism from its parents, can also be overridden if setParallelism() is called.



Key Take Aways:

  • Based on the inputFormat the splits are decided. Spark use hadoop InputFormats.
  • The parallel factor for the first Stage is based on the input formats.
  • A shuffle results in redistribution of data.
  • The parallel factor after the shuffle(Num Reduce Tasks) is still based on its parent RDD(Num Map Tasks).
  • Use repartition() or setParellelism to control the number of reduce task spawned.
  • Shuffle results in creation of stages.
  • All functions applied on RDD(Transformations) will be chained in one stage.
  • Stage => group of Transformations(functions). A stage will comprise of multiple Tasks.
  • Each Task in a stage is spawned on a Executor as a Thread.
  • A executor can spawn multiple Task.
  • Task in Spark is a thread and not a process.
  • Task can be map or reduce task.
  • Tasks run on executors and hence share the same JVM , any static reference will be shared across all the tasks.



End to End flow:

  • Driver take your Java code.
  • Driver analyses which all actions need to be performed and combines the transformations in the path of action.
  • Driver generates a optimized work-flow of Tasks.
  • Task are grouped together in stages.
  • Task resulting into shuffle decides the boundary of stages.
  • Tasks are instantiated on driver , serialized and sent to executor.
  • The initial parallel factor for a stage is decided by the input Format used.