Creating RDD - Basics

kumarnitinkarn
Sep 23, 2019
1 min read

There are two ways by which you can create RDD :

By using parallelize() : You can create RDD from existing collection using parallelize() method.This way is used generally for learning or POC purpose.

Initially, for practice you can run below code on pyspark terminal.

For opening pyspark terminal you can either use cloudera-quickstart or configure

pseudonode cluster on your unix machine.

Syntax :

data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)])

By using textFile() : You can create RDD from some external file by using textFile().

Syntax :

rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt')

Additionally, both parallelize and textFile() takes second parameter which is the number of

partitions.

Ex :

data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)],4)

Here, 4 is the number of partitions.

or,

rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt',5)

Here, 5 is the number of partitions.

We will study more about partition in upcoming posts.

Happy Learning !!

Creating RDD - Basics

Recent Posts

Comments