Creating RDD - Basics
- kumarnitinkarn
- Sep 23, 2019
- 1 min read
There are two ways by which you can create RDD :
By using parallelize() : You can create RDD from existing collection using parallelize() method.This way is used generally for learning or POC purpose.
Initially, for practice you can run below code on pyspark terminal.
For opening pyspark terminal you can either use cloudera-quickstart or configure
pseudonode cluster on your unix machine.
Syntax :
data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)])
By using textFile() : You can create RDD from some external file by using textFile().
Syntax :
rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt')
Additionally, both parallelize and textFile() takes second parameter which is the number of
partitions.
Ex :
data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)],4)
Here, 4 is the number of partitions.
or,
rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt',5)
Here, 5 is the number of partitions.
We will study more about partition in upcoming posts.
Happy Learning !!
Comments