Apache Spark: RDD Actions
Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. In other words, a RDD operation that returns a value of any type but RDD[T] is an action.
1) Reduce
Spark RDD reduce function reduces the elements of this RDD using the specified commutative and associative binary operator.
Spark reduce operation is almost similar as reduce method in Scala. It is an action operation of RDD which means it will trigger all the lined up transformation on the base RDD (or in the DAG) which are not executed and than execute the action operation on the last RDD. This operation is also a wide operation. In the sense the execution of this operation results in distributing the data across the multiple partitions.
It accepts a function with (which accepts two arguments and returns a single element) which should be Commutative and Associative in mathematical nature. That intuitively means, this function produces same result when repetitively applied on same set of RDD data with multiple partitions irrespective of element’s order.
Important points to note are,
• reduce is an action operation in Spark hence it triggers execution of DAG and gets execute on final RDD
• It is a wide operation as it is shuffling data from multiple partitions and reduces to a single value
• It accepts a Commutative and Associative function as an argument
• The parameter function should have two arguments of the same data type
• The return type of the function also must be same as argument types
2) ReduceByKey
Spark RDD reduceByKey function merges the values for each key using an associative reduce function.
Basically reduceByKey function works only for RDDs which contains key and value pairs kind of elements(i.e RDDs having tuple or Map as a data element). It is a transformation operation which means it is lazily evaluated. We need to pass one associative function as a parameter, which will be applied to the source RDD and will create a new RDD as with resulting values(i.e. key value pair). This operation is a wide operation as data shuffling may happen across the partitions.