Spark Accumulators


Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations

Spark by default supports to create an accumulators of any numeric type and provide a capability to add custom accumulator types.

Programmers can create following accumulators

  • named accumulators
  • unnamed accumulators

When you create a named accumulator, you can see them on Spark web UI under the “Accumulator” tab. On this tab, you will see two tables; the first table “accumulable” – consists of all named accumulator variables and their values. And on the second table “Tasks” – value for each accumulator modified by a task.

And, unnamed accumulators are not shows on Spark web UI, For all practical purposes it is suggestible to use named accumulators.

Creating Accumulator variable

Spark by default provides accumulator methods for long, double and collection types. All these methods are present in SparkContext class and return LongAccumulatorDoubleAccumulator, and CollectionAccumulator respectively.

  • Long Accumulator
  • Double Accumulator
  • Collection Accumulator

For example, you can create long accumulator on spark-shell using

The above statement creates a named accumulator “SumAccumulator”. Now, Let’s see how to add up the elements from an array to this accumulator.

Each of these accumulator classes has several methods, among these, add() method call from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the driver program can read accumulators value using the value() method.

Long Accumulator

longAccumulator() methods from SparkContext returns LongAccumulator

Syntax

You can create named accumulators for long type using SparkContext.longAccumulator(v) 

For unnamed use signature that doesn’t take an argument SparkContext.longAccumulator() 

Accumulator program as follows.

LongAccumulator class provides follwoing methods

  • isZero
  • copy
  • reset
  • add
  • count
  • sum
  • avg
  • merge
  • value

Double Accumulator

For named double type using SparkContext.doubleAccumulator(v) and for unnamed use signature that doesn’t take an argument.

Syntax

DoubleAccumulator class also provides methods similar to LongAccumulator

Collection Accumulator

For named collection type using SparkContext.collectionAccumulator(v) and for unnamed use signature that doesn’t take an argument.

Syntax

CollectionAccumulator class provides following methods

  • isZero
  • copyAndReset
  • copy
  • reset
  • add
  • merge
  • value

Example on how to use collection accumulator

Note: Each of these accumulator classes has several methods, among these, add() method call from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the driver program can read accumulators value using the value() method.

Reference

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *