Partitions - Internals of Spark
- kumarnitinkarn
- Sep 23, 2019
- 1 min read
Single liner - Partition is the unit of achieving parallelism in Spark.
When we apply a transformation on a RDD, the transformation is applied to each of its partition. Spark spawns a single Task for a single partition, which will run inside the executor JVM. Each stage contains as many tasks as partitions of the RDD and will perform the transformations (map, filter etc) pipelined in the stage.
Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the
number of partitions automatically based on your cluster. However, you can also set it manually by :
(1) Passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
(2) The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file
(blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
How to decide on number of partitions :
Having too less partitions results in Less concurrency and it also increases memory pressure
for transformations which involves shuffle.
However, having too many partitions might also have negative impact as too much time will be spent in scheduling multiple tasks.
The recommend number of partitions is around 3 or 4 times the number of CPUs in the cluster so that the work gets distributed more evenly among the CPUs.
In later posts, we will study how and when we can change number of partitions.
Happy learning !!
Commenti