Spark Performance Tuning II : persist() and cache()

kumarnitinkarn
Jan 27, 2020
3 min read

Updated: Jan 31, 2020

When we persist a RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset.

Why we should persist a RDD/DF :

Most of the RDD operations are lazy. Spark will simply create DAG, when you call the action, Spark will execute the series of operations to provide required results.

RDD.cache is also a lazy operation. But when you execute action for the first time, Spark will will persist the RDD in memory for subsequent actions if any.

When should we persist a RDD :

When we are using a RDD/DF more than once then by persisting it to memory, we can prevent multiple computations of that RDD/DF and hence reduce the overall execution time. This way future actions will be much faster.

Sample Code :

# Create a DataFrame for Testing

df = sqlContext.createDataFrame([(20, 'test')],["code", "name"])

# Cache the dateFrame

df.cache()

DataFrame[code: bigint, name: string]

# Test cached dataFrame

df.count()

There are two ways by which you can persist a RDD/DF:

(1) By calling cache()

(2) By calling persist()

The only difference is : The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY.

But using persist() we can specify Storage level as well to which we want to persist our DF/RDD.

Different Storage Level available to persist() :

MEMORY_ONLY : Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK : Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER : Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER : Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY : Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc : Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP : Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

Which Storage level to use in which case :

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option.

If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.

Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

unpersist() :

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

Happy Learning !!

Spark Performance Tuning II : persist() and cache()

Recent Posts

Commentaires