GroupBy vs ReduceByKey | Byte Padding

GroupBy vs ReduceByKey

By
on Mar 04, 2017
in Spark

Whats the difference between groupByKey vs ReduceByKey in Spark

GroupByKey

ReduceByKey / CombineByKey / AggregateByKey:

[addToAppearHere]

GroupByKey	ReduceByKey / CombineByKey / AggregateByKey:
All data is sent from mapTask to reduceTask	Combiner is run on MapTask and reduceTask
No optimization on network I/O	optimized network I/O
Should be used only if all Value of a given key is needed in reduceTask	Should be always used and one should avoid using groupByKey. Should be used when function like sum , average, median, mode , top N are needed
Can lead to GC problem and JobFailure	less data is shuffled so chances of job failure is less
One spark partition can hold max 2 GB of data	One spark partition can hold max 2 GB of data

Tags: combineByKey, groubyKey, reduceByKey, spark