GroupBy vs ReduceByKey

Whats the difference between groupByKey vs ReduceByKey in Spark

GroupByKey

 

 

 

 

ReduceByKey / CombineByKey / AggregateByKey:

 

 

 

 

[addToAppearHere]

GroupByKey             ReduceByKey / CombineByKey / AggregateByKey:
  All data is sent from mapTask to reduceTask  Combiner is run on MapTask and reduceTask
No optimization on network I/O  optimized network I/O
Should be used only if all Value of a given key is needed in reduceTask Should be always used and one should avoid using groupByKey.

Should be used when function like sum , average, median, mode , top N are needed

Can lead to GC problem and JobFailure less data is shuffled so chances of job failure  is less
One spark partition can hold max 2 GB of data One spark partition can hold max 2 GB of data