What is SparkContext | Explained with Code


Spark default language is Scala.

SparkContext (JavaSparkContext for Java) is an entry point to Spark and PySpark to programming with RDD and to connect to Spark Cluster,

In this article, you will learn how to create it using examples.

What is SparkContext

Spark SparkContext is an entry point to Spark and defined in org.apache.spark package since 1.x and used to programmatically create Spark RDD, accumulators and broadcast variables on the cluster. Its object sc is default available in spark-shell and it can be programmatically created using SparkContext class.

SparkContext has been available since Spark 1.x version and it’s an entry point to Spark when you wanted to program and use Spark RDD Most of the operations/methods or functions we use in Spark come from SparkContext for example accumulators, broadcast variables, parallelize and more.

Note that you can create only one SparkContext per JVM.

At any given time only one SparkContext instance should be active per JVM. In case if you want to create a another new SparkContext you should stop existing Sparkcontext (using stop()) before creating a new one.

SparkContext in spark-shell

How to open Spark-Shell

Step 1: Go to spark Installation folder . ex: /usr/local/spark ,you will find bin directory . In this directory many .sh files are there to communicate with spark.

Type : /bin/spark-shell

The following information will be displayed.

Spark Bin -Spark Shell

If you have set the spark home in .bashrc then you can type spark-shell directly anywhere on the terminal.

Be default Spark shell provides “sc” object which is an instance of SparkContext class. We can directly use this object where required us to work on different operations.


  val rdd = sc.textFile("/src/main/resources/text/alice.txt")

Creating SparkContext using Scala program in 1.X

When you do programming wither with Scala, PySpark or Java, first you need to create a SparkConf instance by assigning app name and setting master by using the SparkConf static methods setAppName() and setMaster() respectively and then pass SparkConf object as an argument to SparkContext constructor to create Spark Context.


  val sparkConf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(sparkConf )

SparkContext constructor has been deprecated in 2.0 hence, the recommendation is to use a static method getOrCreate() to create SparkContext. This function is used to get or instantiate a SparkContext and register it as a singleton object.


SparkContext.getOrCreate(sparkConf)

Once you create a Spark Context object, use this to create Spark RDD.


  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

Creating SparkContext using Scala program since 2.x

Since Spark 2.0, we mostly use SparkSession and most of the methods available in SparkContext are also present in   SparkSession and Spark session internally creates the Spark Context and exposes the sparkContext variable to use.


val soarkContext = spark.sparkContext

SparkContext commonly used methods

accumulator – It creates an accumulator variable of a given data type. Only a driver can access accumulator variables.

applicationId – Returns a unique ID of a Spark application.

appName – Return an app name that was given when creating SparkContext

broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.

emptyRDD – Creates an emptyRDD

getPersistentRDDs – Returns all persisted RDD’s

getOrCreate – Creates or returns a SparkContext

hadoopFile – Returns an RDD of a Hadoop file

master()–  Returns master that set while creating SparkContext

newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.

sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.

setLogLevel – Change log level to debug, info, warn, fatal and error

textFile –  Read a text file from HDFS, local or any Hadoop supported file systems and returns an RDD

union – Union two RDD

wholeTextFiles – Reads a text file from a folder  from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. First element of the tuple consists file name and the second element consists context of the text file.

SparkContext Example


package com.sparkbyexamples.spark.stackoverflow

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextOld extends App{

  val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
  val sparkContext = new SparkContext(conf)
  val rdd = sparkContext.textFile("/src/main/resources/text/alice.txt")

  sparkContext.setLogLevel("ERROR")

  println("First SparkContext:")
  println("APP Name :"+sparkContext.appName);
  println("Deploy Mode :"+sparkContext.deployMode);
  println("Master :"+sparkContext.master);
 // sparkContext.stop()
  
  val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
  val sparkContext2 = new SparkContext(conf2)

  println("Second SparkContext:")
  println("APP Name :"+sparkContext2.appName);
  println("Deploy Mode :"+sparkContext2.deployMode);
  println("Master :"+sparkContext2.master);
  
}

Conclusion

In this Spark Context article, you have explained what is SparkContext, how to create in Spark 1.x and Spark 2.0 and using with few basic examples.

Reference

https://spark.apache.org/docs/latest/index.html

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *