Spark Performance Tuning III : Broadcast variables and Accumulators

kumarnitinkarn
Jan 31, 2020
2 min read

Broadcast Variables

Let's first discuss the general process that how executor works when we do not use a broadcast variable :

When we pass a function to any Spark operation, it executes on a remote node in the cluster. They work on different copies of all the variables in the cluster. These variable copies to each machine in the cluster. No updates are given to driver program after the commencement

of any action.

In other words, shipping variables to each executor for every transformation and action can cause network overhead.

Apache Spark provides two types of shared variable namely broadcast variable and accumulator.

Broadcast variable :

If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.

First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program.

Value method can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.

>>> broadcastVar = sc.broadcast([1, 2, 3])

<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value

[1, 2, 3]

Things to remember while using Broadcast variables:

Once we broadcasted the value to the nodes, we shouldn’t make changes to its value to make sure each node have exact same copy of data.

The modified value might be sent to another node later that would give unexpected results.

Accumulators :

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.

Lets try a simple example on accumulators :

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on a cluster can then add to it using the add method or the += operator. However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

>>> accu = sc.accumulator(0)

>>> accu

Accumulator<id=0, value=0>

>>> sc.parallelize([1, 2, 3, 4, 5]).foreach(lambda x: accu.add(x))

INFO SparkContext: Tasks finished in 0.23154 s

>>> accum.value

Keep Learning !!

Spark Performance Tuning III : Broadcast variables and Accumulators

Recent Posts

Comments