In this post ,we are going to explain about sparksession.
Since Spark 2.0 SparkSession has become an entry point to Spark programming with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session and using default Spark Session ‘spark’ variable from spark-shell.
What is SparkSession
SparkSession introduced in version 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame and DataSet. It’s object
spark is default available in spark-shell and it can be created programmatically using SparkSession builder pattern.
With Spark 2.0 a new class
org.apache.spark.sql.SparkSession has been introduced to use which is a combined class for all different contexts we used to have prior to 2.0 (SQLContext and HiveContext e.t.c) release hence Spark Session can be used in replace with SQLContext, HiveContext and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to Spark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame and Dataset. SparkSession will be created using
SparkSession.builder() builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.
Spark Session also includes all the APIs available in different contexts –
- Spark Context,
- SQL Context,
- Streaming Context,
- Hive Context.
SparkSession in spark-shell
Be default Spark shell provides “
spark” object which is an instance of SparkSession class. We can directly use this object where required in spark-shell.
scala> val sqlcontext = spark.sqlContext
Similar to Spark shell, In most of the tools, the environment itself creates default SparkSession object for us to use so you don’t have to worry of creating spark session
Creating SparkSession from Scala program
To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and set master and app name and finally calling
getOrCreate() method. This method returns an already existing SparkSession if not exists, it creates new SparkSession.
val spark = SparkSession.builder()
SparkSession commonly used methods
– Returns Spark version where your application is running, probably the Spark version you cluster is configured with.
createDataFrame() – This creates a DataFrame from a collection(which w are going to provide to method) and after that we can create an RDD
createDataset() – This creates a Dataset from the collection, DataFrame, and RDD.
emptyDataFrame() – Creates an empty DataFrame.
emptyDataset() – Creates an empty Dataset.
getActiveSession() – returns an active Spark session.
implicits() – You can access the nested Scala object.
read() – Returns an instance of
DataFrameReader class, this is used to read records from csv, parquet, avro and more file formats into DataFrame.
readStream() – Returns an instance of
DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.
sparkContext() – Returns a SparkContext.
sql – Returns a DataFrame after executing the SQL mentioned.
sqlContext() – Returns SQLContext.
stop() – Stop the current SparkContext.
table() – Returns a DataFrame of a table or view.
udf() – Creates a UDF.
In this Spark SparkSession article, you have learned what is Spark Session, it’s usage and how to create SparkSession programmatically and finally have learned some of the commonly used SparkSession methods.