top of page
Search

Creating RDD - Basics

  • kumarnitinkarn
  • Sep 23, 2019
  • 1 min read

There are two ways by which you can create RDD :


By using parallelize() : You can create RDD from existing collection using parallelize() method.This way is used generally for learning or POC purpose.


Initially, for practice you can run below code on pyspark terminal.

For opening pyspark terminal you can either use cloudera-quickstart or configure

pseudonode cluster on your unix machine.


Syntax :

data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)])

By using textFile() : You can create RDD from some external file by using textFile().


Syntax :

rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt')

Additionally, both parallelize and textFile() takes second parameter which is the number of

partitions.


Ex :

data = sc.parallelize([('John', 12), ('Jack', 32), ('Sana',81), ('Nahad', 34), ('Farhad', 23)],4)


Here, 4 is the number of partitions.

or,


rdd_from_file = sc.textFile('/Users/user_name/Documents/PySpark_Data/test.txt',5)

Here, 5 is the number of partitions.

We will study more about partition in upcoming posts.


Happy Learning !!



 
 
 

Recent Posts

See All

Comments


©2019 by Spark knack. Proudly created with Wix.com

bottom of page