Spark Pair RDD Functions Explanation


Spark has PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair,

In this post, we will learn those functions with Scala examples.

All these functions are grouped into Transformations and Actions similar to regular RDD’s.

Spark Pair RDD Transformation Functions

PAIR RDD FUNCTIONSFUNCTION DESCRIPTION
aggregateByKeyAggregate the values of each key in a data set. This function can return a different result type then the values in input RDD.
combineByKeyCombines the elements for each key.
combineByKeyWithClassTagCombines the elements for each key.
flatMapValuesIt’s flatten the values of each key with out changing key values and keeps the original RDD partition.
foldByKeyMerges the values of each key.
groupByKeyReturns the grouped RDD by grouping the values of each key.
mapValuesIt applied a map function for each value in a pair RDD with out changing keys.
reduceByKeyReturns a merged RDD by merging the values of each key.
reduceByKeyLocallyReturns a merged RDD by merging the values of each key and final result will be sent to the master.
sampleByKeyReturns the subset of the RDD.
subtractByKeyReturn an RDD with the pairs from this whose keys are not in other.
keysReturns all keys of this RDD as a RDD[T].
valuesReturns an RDD with just values.
partitionByReturns a new RDD after applying specified partitioner.
fullOuterJoinReturn RDD after applying fullOuterJoin on current and parameter RDD
joinReturn RDD after applying join on current and parameter RDD
leftOuterJoinReturn RDD after applying leftOuterJoin on current and parameter RDD
rightOuterJoinReturn RDD after applying rightOuterJoin on current and parameter RDD

Spark Pair RDD Actions

PAIR RDD ACTION FUNCTIONSFUNCTION DESCRIPTION
collectAsMapReturns the pair RDD as a Map to the Spark Master.
countByKeyReturns the count of each key elements. This returns the final result to local Map which is your driver.
countByKeyApproxSame as countByKey but returns the partial result. This takes a timeout as parameter to specify how long this function to run before returning.
lookupReturns a list of values from RDD for a given input key.
reduceByKeyLocallyReturns a merged RDD by merging the values of each key and final result will be sent to the master.
saveAsHadoopDatasetSaves RDD to any hadoop supported file system (HDFS, S3, ElasticSearch, e.t.c), It uses Hadoop JobConf object to save.
saveAsHadoopFileSaves RDD to any hadoop supported file system (HDFS, S3, ElasticSearch, e.t.c), It uses Hadoop OutputFormat class to save.
saveAsNewAPIHadoopDatasetSaves RDD to any hadoop supported file system (HDFS, S3, ElasticSearch, e.t.c) with new Hadoop API, It uses Hadoop Configuration object to save.
saveAsNewAPIHadoopFileSaves RDD to any hadoop supported fule system (HDFS, S3, ElasticSearch, e.t.c), It uses new Hadoop API OutputFormat class to save.

Pair RDD Functions Examples

I will explain Spark pair RDD functions with scala examples, before we get started let’s create a pair RDD.

This snippet creates a pair RDD by splitting by space on every element in an RDD, flatten it to form a single word string on each element in RDD and finally assigns an integer “1” to every word.

distinct – Returns distinct keys.

sortByKey – Transformation returns an RDD after sorting by key

Output

reduceByKey – Transformation returns an RDD after adding value for each key.

Result RDD contains unique keys.

This reduces the key by summing the values. Yields below output.

aggregateByKey – Transformation same as reduceByKey

In our example, this is similar to reduceByKey but uses a different approach.


(Brazil,1)
(Canada,1)
(China,1)
(USA,2)
(Germany,1)
(Russia,1)
(India,3)

keys – Return RDD[K] with all keys in an dataset

Yields below output

values – return RDD[V] with all values in an dataset

count – This is an action function and returns a count of a dataset

collectAsMap – This is an action function and returns Map to the master for retrieving all date from a dataset.

Yields below output:

Complete Example

References:

https://spark.apache.org/docs/latest/index.html

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *