RDD’s or resilient distributed datasets are the fundamental data abstraction
in Apache Spark. An RDD amounts to what is a distributed dataset in memory.
It is distributed or partitioned among various nodes in a Spark cluster.
The official definition of an RDD in the official documentation is as follows:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
In the seminal paper on RDDs by Matei Zaharia, an RDD is defined as follows:
Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
The Resilient part is due to the fact that it is fault tolerant and hence if a
partition of the dataset is lost due to a node going done it can be recovered in a timely manner. The recoverability comes from the fact that its lineage graph is maintained by the driver program and hence the partition can be recomputed.
The lineage graph consists of a series of transformations and actions to be
computed on a partition of data and this is stored on the driver program.
The RDD in effect is a set of instructions which are lazily evaluated.
The evaluation only occurs when an action is encountered.
Here is an example of what an RDD’s lineage graph looks like:
res6: String =
(2) MapPartitionsRDD at map at <console>:35 
| MapPartitionsRDD at filter at <console>:33 
| MapPartitionsRDD at map at <console>:31 
| MapPartitionsRDD at map at <console>:31 
| MapPartitionsRDD at textFile at <console>:29 
The Distributed part stems from the fact that the data is distributed among worker nodes of a cluster. The driver program sends the RDD along with which partition of the data that should be computed on that particular cluster. The program on the worker node that is responsible for executing the set of instructions encapsulated in the RDD is called the Executor. The exact call is as follows, in
def compute(split: Partition, context: TaskContext): Iterator[T]
The Dataset expresses the fact that we are in fact processing a collection of
data albeit one that will be partitioned.
Again, taking a look at the source code, in
we see :
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming
computation in it.
protected def getPartitions: Array[Partition]
Types of Operations
RDDs support 2 types of operations: Transformations and Actions.
Transformations convert an RDD from one type to another.
They are lazily evaluated, meaning that they’re only executed when data is ready to be output.
Actions trigger the execution of the chain of operations that result in data being returned. They are necessary for transformations to be evaluated. Without an action, an RDD is just a chain of transformations that are yet to be evaluated.
Contents of an RDD
An RDD is characterized by the of the following 5 properties:
- A list of partitions that comprise the dataset.
- A function to perform the computation for each partition.
- A list of dependencies on other RDDs i.e. parent RDDs.
The parent RDD is the initial RDD on which a transformation is
- A Partitioner for key-value/Pair RDDs (Pair RDDs are defined later).
- A list of preferred locations/hosts to load the various partitions into
which the data has been split.
Where does the RDD live ?
The RDD lives in the Spark Context on the driver program which runs on the master node in a cluster setup.
It is uniquely identified by an id :
scala> val exampleRDD = sc.parallelize(List(1,2,3,4,5))
exampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at <console>:27
res34: Int = 12
scala> val example2RDD = sc.parallelize(List(1,2,3,4,5))
example2RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at <console>:27
res36: Int = 13
scala> val example3RDD=exampleRDD
example3RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at <console>:27
res37: Int = 12
Ways to create an RDD
- Parallelize an existing collection e.g.
primesRDD = sc.parallelize([2,3,5,7,11,13,17])
- Read data from an external distributed source e.g. HDFS, S3, Cassandra
gettysburgRDD = sc.textFile("hdfs://user/dataphantik/gettysburg.txt"
- From an existing RDD – transforming an existing RDD into new RDD
errorsRDD = baseRDD.filter(lambda line : line contains 'Error')
- Via Spark Streaming
Types of RDDs
The base class of all RDDs, org.apache.spark.rdd.RDD is an abstract class which all base RDDs inherit from.
The base RDD classes are:
- HadoopRDD – Provides core functionality for reading data stored in Hadoop (e.g
files in HDFS, sources in HBase, or S3), using the older MapReduce API (`org.
- MapPartitionsRDD – applies the provided function to every partition of the
parent RDD. Normally returned when an RDD is created from a file via sc.textFile(..)
- ParallelCollectionRDD – RDD representing a collection of elements. It containe
s numSlices Partitions and locationPrefs, which is a Map. It is obtained from call to sc.parallelize(..) on a collection in memory.
- PipedRDD – An RDD that pipes the contents of each parent partition through an e
xternal command (printing them one per line) and returns the output as a collect
ion of strings.
There are also depictions of
- PairRDD – an RDD of key-value pairs. It can be a ParallelCollectionRDD containin
g key-value pairs. There is no concrete PairRDD class, since it is an abstraction, however a class, PairRDDFunctions provides a set of transformations that can
be performed on Pair RDDs.