SparkContext v/s SparkSession - Deep Dive

kumarnitinkarn
Jan 28, 2020
2 min read

Prior to spark 2.0, Spark Context was the entry point of any spark application and used to access all spark features. The spark driver program uses sparkContext to connect to the cluster through resource manager.

SparkConf is required to create the spark context object, which stores configuration parameters like appName (to identify your spark driver), number core and memory size of executor running on worker node.

In order to use API’s of SQL, Hive, and Streaming, separate context needs to be created.

Creating a spark context :

If you are in spark-shell, a SparkContext is already available for you and is assigned to the variable sc. If you don’t have a SparkContext already, you can create one by first creating a SparkConf first.

//set up the spark configuration and create contexts

sparkConf = SparkConf().setAppName("SparkSessionExample").setMaster("local")

// your handle to SparkContext to access other context like SQLContext

sc = SparkContext(conf=conf)

//Creating SQL context

sqlContext = SQLContext(sc)

Spark Session:

In Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.

Creating a spark session :

// Create a SparkSession. No need to create SparkContext

// You automatically get it as part of the SparkSession

warehouseLocation = "file:${system:user.dir}/spark-warehouse"

spark = SparkSession

.builder

.appName("SparkSessionZipsExample")

.config("spark.sql.warehouse.dir", warehouseLocation)

.enableHiveSupport()

.getOrCreate()

Note : enableHiveSupport() -> Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.

getOrCreate() -> Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.

At this point you can use the spark variable as your instance object to access its public methods and instances for the duration of your Spark job.

Configuring spark runtime configuration using spark session object:

//set new runtime options

spark.conf.set("spark.sql.shuffle.partitions", 6)

spark.conf.set("spark.executor.memory", "2g")

How to access the underlying SparkContext using sparkSession object

We can access spark context and other contexts using the spark session object.

SparkSession.sparkContext returns the underlying sparkContext, used for creating RDDs as well as managing cluster resources.

Spark.sparkContext

res17: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2debe9ac

Why do I need Spark session when I already have Spark context?

Consider a scenario where we have multiple users accessing the same notebook environment which had shared spark context and the requirement was to have an isolated environment sharing the same spark context.

Prior to 2.0, the solution to this was to create multiple spark contexts i.e. spark context per isolated environment but it is an expensive operation ans also, crashing of 1 spark context can affect the other.

But with the introduction of the spark session, this issue has been addressed.

After spark2.0, Spark gives a straight forward API to create a new session which shares the same spark context.

spark.newSession() -> creates a new spark session object.

scala> val session2 = spark.newSession()

session2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@691fffb9

scala> spark

res22: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@506bc254

scala> spark.sparkContext

res26: org.apache.spark.SparkContext = org.apache.spark.SparkContext@715fceaf

scala> session2.sparkContext

res27: org.apache.spark.SparkContext = org.apache.spark.SparkContext@715fceaf

Look at the hash of spark and session2 which is different that ensures both are different.

But hash of sparkContext with both sessions are same, this means both spark session shares same sparkContext.

This isolation is for the configurations as well. Both sessions can have their own configs.

Keep Learning keep growing !!

SparkContext v/s SparkSession - Deep Dive

Recent Posts

Comments