GroupBy vs ReduceByKey
Whats the difference between groupByKey vs ReduceByKey in Spark
GroupByKey

ReduceByKey / CombineByKey / AggregateByKey:

[addToAppearHere]
| GroupByKey | ReduceByKey / CombineByKey / AggregateByKey: |
| All data is sent from mapTask to reduceTask | Combiner is run on MapTask and reduceTask |
| No optimization on network I/O | optimized network I/O |
| Should be used only if all Value of a given key is needed in reduceTask | Should be always used and one should avoid using groupByKey.
Should be used when function like sum , average, median, mode , top N are needed |
| Can lead to GC problem and JobFailure | less data is shuffled so chances of job failure is less |
| One spark partition can hold max 2 GB of data | One spark partition can hold max 2 GB of data |