RDD - Fundamental Data Structure of Spark

kumarnitinkarn
Sep 10, 2019
2 min read

Resilient Distributed Dataset (RDD), which is a collection of elements partitioned across the nodes

of the cluster that can be operated on in parallel.

RDD Stands for :

Resilient means fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions.

Distributed, obviously as Data resides on multiple nodes/systems.

Dataset represents bytes of data you work with. The data can be from any external source which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

So far, we understood that RDD is nothing but just a partitioned data structure that resides on multiple nodes.Now, Lets understand what operations we can perform on RDDs.

Operations on RDD :

(1) Transformations : Transformations create a new dataset from an existing one.

Ex: Map is a transformation that traverse each element of a dataset through a function and returns a new RDD that represents the result.

(2) Actions : Actions return a value to the driver program after running a computation on the dataset.

Ex: Reduce is a action that performs aggregation on each element of a RDD using some function and return the result to the driver program.

NOTE : Each transformed RDD may be recomputed each time you run an action on it. How to avoid this, will be cover in upcoming posts explaining optimization techniques in spark.

Terminologies that will help you to understand this post and upcoming posts well :

Driver Program : The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master, it runs in its own Java process.It is responsible for converting user program into the unit of physical execution called task.

RDD lineage graph aka 'Logical Execution Plan' : While we create a new RDD from an existing Spark RDD, that new RDD also carries a pointer to the parent RDD in Spark. That is the same as all the dependencies between the RDDs those are logged in a graph, rather than the actual data. It is what we call as lineage graph.

It is built as a result of applying transformations to the RDD.

Enough details to get a brief insight of RDD. Lets catch up with RDD creation in upcoming posts.

Happy Learning !!

RDD - Fundamental Data Structure of Spark

Recent Posts

Comments