GroupBy vs ReduceByKey
Whats the difference between groupByKey vs ReduceByKey in Spark
GroupByKey
ReduceByKey / CombineByKey / AggregateByKey:
[addToAppearHere]
GroupByKey | ReduceByKey / CombineByKey / AggregateByKey: |
All data is sent from mapTask to reduceTask | Combiner is run on MapTask and reduceTask |
No optimization on network I/O | optimized network I/O |
Should be used only if all Value of a given key is needed in reduceTask | Should be always used and one should avoid using groupByKey.
Should be used when function like sum , average, median, mode , top N are needed |
Can lead to GC problem and JobFailure | less data is shuffled so chances of job failure is less |
One spark partition can hold max 2 GB of data | One spark partition can hold max 2 GB of data |