Spark RDD Actions with examples


RDD actions are operations that return the raw values, In other words, any RDD function that returns other than RDD[T] is considered as an action in spark programming.

 In this tutorial, we will learn RDD actions with Scala examples.

As mentioned in RDD Transformations, all transformations are lazy meaning they do not get executed right away and action functions trigger to execute the transformations.

Spark RDD Actions

Select a link from the table below to jump to an example.

RDD ACTION METHODSMETHOD DEFINITION
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): UAggregate the elements of each partition, and then the results for all the partitions.
collect():Array[T]Return the complete dataset as an Array.
count():LongReturn the count of elements in the dataset.
countApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]Return approximate count of elements in the dataset, this method returns incomplete when execution time meets timeout.
countApproxDistinct(relativeSD: Double = 0.05): LongReturn an approximate number of distinct elements in the dataset.
countByValue(): Map[T, Long]Return Map[T,Long] key representing each unique value in dataset and value represent count each value present.
countByValueApprox(timeout: Long, confidence: Double = 0.95)(implicit ord: Ordering[T] = null): PartialResult[Map[T, BoundedDouble]]Same as countByValue() but returns approximate result.
first():TReturn the first element in the dataset.
fold(zeroValue: T)(op: (T, T) ⇒ T): TAggregate the elements of each partition, and then the results for all the partitions.
foreach(f: (T) ⇒ Unit): UnitIterates all elements in the dataset by applying function f to all elements.
foreachPartition(f: (Iterator[T]) ⇒ Unit): UnitSimilar to foreach, but applies function f for each partition.
min()(implicit ord: Ordering[T]): TReturn the minimum value from the dataset.
max()(implicit ord: Ordering[T]): TReturn the maximum value from the dataset.
reduce(f: (T, T) ⇒ T): TReduces the elements of the dataset using the specified binary operator.
saveAsObjectFile(path: String): UnitSaves RDD as a serialized object’s to the storage system.
saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): UnitSaves RDD as a compressed text file.
saveAsTextFile(path: String): UnitSaves RDD as a text file.
take(num: Int): Array[T]Return the first num elements of the dataset.
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]Return the first num (smallest) elements from the dataset and this is the opposite of the take() action.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]Return the subset of the dataset in an Array.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
toLocalIterator(): Iterator[T]Return the complete dataset as an Iterator.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
top(num: Int)(implicit ord: Ordering[T]): Array[T]Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.
treeAggregateAggregates the elements of this RDD in a multi-level tree pattern.
treeReduceReduces the elements of this RDD in a multi-level tree pattern.

RDD Actions Example

Before we start explaining RDD actions with examples, first, let’s create an RDD.

Note that we have created two RDD’s in the above code snippet and we use these two as and when necessary to demonstrate the RDD actions.

aggregate – action

aggregate() the elements of each partition, and then the results for all the partitions.

treeAggregate – action

treeAggregate() – Aggregates the elements of this RDD in a multi-level tree pattern. The output of this function will be similar to the aggregate function.

fold – action

fold() – Aggregate the elements of each partition, and then the results for all the partitions.

reduce

reduce() – Reduces the elements of the dataset using the specified binary operator.

treeReduce

treeReduce() – Reduces the elements of this RDD in a multi-level tree pattern.

collect

collect() -Return the complete dataset as an Array.

count, countApprox, countApproxDistinct

count() – Return the count of elements in the dataset.

countApprox() – Return approximate count of elements in the dataset, this method returns incomplete when execution time meets timeout.

countApproxDistinct() – Return an approximate number of distinct elements in the dataset.

countByValue, countByValueApprox

countByValue() – Return Map[T,Long] key representing each unique value in dataset and value represents count each value present.

countByValueApprox() – Same as countByValue() but returns approximate result.

first

first() – Return the first element in the dataset.

top

top() – Return top n elements from the dataset.

Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

min

min() – Return the minimum value from the dataset.

max

max() – Return the maximum value from the dataset.

take, takeOrdered, takeSample

take() – Return the first num elements of the dataset.

takeOrdered() – Return the first num (smallest) elements from the dataset and this is the opposite of the take() action.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

takeSample() – Return the subset of the dataset in an Array.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

Actions – Complete example

References:

https://spark.apache.org/docs/latest/quick-start.html

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *