Spark Broadcast Variables


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs.

Use case

Let me explain with an example, assume you are getting a two-letter country state code in a file and you wanted to transform it to full state name, (for example CA to California, NY to New York e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you may have many such lookups (like zip code).

Instead of distributing this information along with each task over the network (which is overhead and time consuming), we can use the broadcast variable to cache this lookup info on each machine and tasks use this cached info while executing the transformations.

How does Spark Broadcast work?

Broadcast variables are used in the same way for RDD, DataFrame, and Dataset. Once you have defined Broadcast variables.

When you run a Spark RDD, DataFrame jobs which has the Broadcast variables defined and used, Spark does the following.

  • Spark breaks the job into stages that have distributed shuffling and actions are executed with in the stage.
  • Later Stages are also broken into tasks
  • Spark broadcasts the common data (reusable) needed by tasks within each stage.
  • The broadcasted data is cache in serialized format and deserialized before executing each task.

You should be creating and using broadcast variables for data that shared across multiple stages and tasks.

Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will be sent to executors when they are first used.

How to create Broadcast variable

The Spark Broadcast is created using the broadcast(v) method of the SparkContext class. This method takes the argument v that you want to broadcast.

In Spark shell

Spark RDD Broadcast variable example

Below is a very simple example of how to use broadcast variables on RDD. This example defines commonly used data (country and states) in a Map variable and distributes the variable using SparkContext.broadcast() and then use these variables on RDD map() transformation.

Spark DataFrame Broadcast variable example

Below is an example of how to use broadcast variables on DataFrame. similar to above RDD example, This defines commonly used data (country and states) in a Map variable and distributes the variable using SparkContext.broadcast() and then use these variables on DataFrame map() transformation.

Some explanation about broadcast variable is as follows.

Reference

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *