SPARK KNACK

Naturally Curious

Home: Welcome

kumarnitinkarn

Spark Performance Tuning III : Broadcast variables and Accumulators

Broadcast Variables Let's first discuss the general process that how executor works when we do not use a broadcast variable : When we...

kumarnitinkarn

Prior to spark 2.0, Spark Context was the entry point of any spark application and used to access all spark features. The spark driver...

kumarnitinkarn

When we persist a RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset....

kumarnitinkarn

Executor : An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data...

kumarnitinkarn

Deployment mode tells where the driver program will be running. Note : The spark driver is the program that declares the transformations...

kumarnitinkarn

Transformation creates RDD for other RDD/RDDs but the result is not computed until we trigger an action. When we a trigger an action,...

kumarnitinkarn

So far we have studied two frequently used transformations map() and flatMap(), below are some other important transformations that one...

kumarnitinkarn

We have already studied about the transformations and its basics. Here in this section we will study two very important and frequently...

kumarnitinkarn

Single liner - Partition is the unit of achieving parallelism in Spark. When we apply a transformation on a RDD, the transformation is...

kumarnitinkarn

There are two ways by which you can create RDD : By using parallelize() : You can create RDD from existing collection using parallelize()...

kumarnitinkarn

Before diving deep in Spark Context and Spark Session, understand the basic difference between the two - Spark session is a unified entry...

kumarnitinkarn

Resilient Distributed Dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated...

kumarnitinkarn

Those who do not have any exposure to MapReduce can skip this article and start from upcoming one. (1) Performance in terms of execution...

kumarnitinkarn

Apache Spark is open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data....

Home: Blog2

Home: Subscribe