
SPARK KNACK
Naturally Curious
SparkContext v/s SparkSession - Deep Dive
Prior to spark 2.0, Spark Context was the entry point of any spark application and used to access all spark features. The spark driver...
Spark Performance Tuning II : persist() and cache()
When we persist a RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset....
Spark Performance Tuning I : Setting appropriate number of Executors :
Executor : An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data...
Deployment Modes in Spark
Deployment mode tells where the driver program will be running. Note : The spark driver is the program that declares the transformations...
RDD Actions with examples
Transformation creates RDD for other RDD/RDDs but the result is not computed until we trigger an action. When we a trigger an action,...
Transformations - Other important ones
So far we have studied two frequently used transformations map() and flatMap(), below are some other important transformations that one...
RDD Transformations - map() v/s flatMap()
We have already studied about the transformations and its basics. Here in this section we will study two very important and frequently...
Partitions - Internals of Spark
Single liner - Partition is the unit of achieving parallelism in Spark. When we apply a transformation on a RDD, the transformation is...
Creating RDD - Basics
There are two ways by which you can create RDD : By using parallelize() : You can create RDD from existing collection using parallelize()...
Spark Context/Spark Session : 'The entry point to Spark programming'
Before diving deep in Spark Context and Spark Session, understand the basic difference between the two - Spark session is a unified entry...
RDD - Fundamental Data Structure of Spark
Resilient Distributed Dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated...
MapReduce V/S Spark
Those who do not have any exposure to MapReduce can skip this article and start from upcoming one. (1) Performance in terms of execution...
Why Spark ?
Apache Spark is open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data....