Spark Expositions: Understanding RDDs


RDD’s or resilient distributed datasets are the fundamental data abstraction
in Apache Spark. An RDD amounts to what is a distributed dataset in memory.
It is distributed or partitioned among various nodes in a Spark cluster.
The official definition of an RDD in the official documentation is as follows:

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

In the seminal paper on RDDs by Matei Zaharia, an RDD is defined as follows:

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

The Resilient part is due to the fact that it is fault tolerant and hence if a
partition of the dataset is lost due to a node going done it can be recovered in a timely manner. The recoverability comes from the fact that its lineage graph is maintained by the driver program and hence the partition can be recomputed.

The lineage graph consists of a series of transformations and actions to be
computed on a partition of data and this is stored on the driver program.
The RDD in effect is a set of instructions which are lazily evaluated.
The evaluation only occurs when an action is encountered.
Here is an example of what an RDD’s lineage graph looks like:

scala> idBudgetRDD.toDebugString
res6: String = 
(2) MapPartitionsRDD[8] at map at <console>:35 []
 | MapPartitionsRDD[7] at filter at <console>:33 []
 | MapPartitionsRDD[3] at map at <console>:31 []
 | MapPartitionsRDD[2] at map at <console>:31 []
 | MapPartitionsRDD[1] at textFile at <console>:29 []

The Distributed part stems from the fact that the data is distributed among worker nodes of a cluster. The driver program sends the RDD along with which partition of the data that should be computed on that particular cluster. The program on the worker node that is responsible for executing the set of instructions encapsulated in the RDD is called the Executor. The exact call is as follows, in


 def compute(split: Partition, context: TaskContext): Iterator[T]

The Dataset expresses the fact that we are in fact processing a collection of
data albeit one that will be partitioned.

Again, taking a look at the source code, in
we see :

 * Implemented by subclasses to return the set of partitions in this    RDD. This method will only
 * be called once, so it is safe to implement a time-consuming 
   computation in it.
 protected def getPartitions: Array[Partition]

Types of Operations

RDDs support 2 types of operations: Transformations and Actions.

Transformations convert an RDD from one type to another.
They are lazily evaluated, meaning that they’re only executed when data is ready to be output.

Actions trigger the execution of the chain of operations that result in data being returned. They are necessary for transformations to be evaluated. Without an action, an RDD is just a chain of transformations that are yet to be evaluated.

Contents of an RDD

An RDD is characterized by the of the following 5 properties:

  1. A list of partitions that comprise the dataset.
  2. A function to perform the computation for each partition.
  3. A list of dependencies on other RDDs i.e. parent RDDs.
    The parent RDD is the initial RDD on which a transformation is
  4. A Partitioner for key-value/Pair RDDs (Pair RDDs are defined later).
  5. A list of preferred locations/hosts to load the various partitions into
    which the data has been split.

Where does the RDD live ?

The RDD lives in the Spark Context on the driver program which runs on the master node in a cluster setup.

It is uniquely identified by an id :

scala> val exampleRDD = sc.parallelize(List(1,2,3,4,5))
exampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:27

res34: Int = 12 

scala> val example2RDD = sc.parallelize(List(1,2,3,4,5))
example2RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:27

res36: Int = 13

scala> val example3RDD=exampleRDD
example3RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:27

res37: Int = 12


Ways to create an RDD

  1. Parallelize an existing collection e.g.
    primesRDD = sc.parallelize([2,3,5,7,11,13,17])
  2. Read data from an external distributed source e.g. HDFS, S3, Cassandra

     gettysburgRDD = sc.textFile("hdfs://user/dataphantik/gettysburg.txt"
  3. From an existing RDD – transforming an existing RDD into new RDD
    errorsRDD = baseRDD.filter(lambda line : line contains 'Error')
  4. Via Spark Streaming

Types of RDDs

The base class of all RDDs, org.apache.spark.rdd.RDD is an abstract class which all base RDDs inherit from.

The base RDD classes are:

  • HadoopRDD – Provides core functionality for reading data stored in Hadoop (e.g
    files in HDFS, sources in HBase, or S3), using the older MapReduce API (`org.
  • MapPartitionsRDD – applies the provided function to every partition of the
    parent RDD. Normally returned when an RDD is created from a file via sc.textFile(..)
  • ParallelCollectionRDD – RDD representing a collection of elements. It containe
    s numSlices Partitions and locationPrefs, which is a Map. It is obtained from call to sc.parallelize(..) on a collection in memory.
  • PipedRDD – An RDD that pipes the contents of each parent partition through an e
    xternal command (printing them one per line) and returns the output as a collect
    ion of strings.

There are also depictions of

  • PairRDD – an RDD of key-value pairs. It can be a ParallelCollectionRDD containin
    g key-value pairs. There is no concrete PairRDD class, since it is an abstraction, however a class, PairRDDFunctions provides a set of transformations that can
    be performed on Pair RDDs.


Useful Distributed Systems Links

Distributed Systems is a constantly changing field. This page attempts to keep track of sites or blogs that are frequently updated and are chock full of useful information to folks interested in keeping up with the start of the art technologies.


Framework Sites

Distributed Systems Concepts

Pandas Cheatsheet

Reset index for data frame based on specific column


Show null columns of dataframe

df.ix[df.index[(df.T==np.nan).sum() > 1]]

Sum specific column of dataframe, excluding NULLs


Loop over rows of dataframe
With iterrows (produces index, columns as a Series):

for idx, row in df.iterrows():
         print idx, row['a'], row['col2'], row['col3']

With itertuples ( produces tuple of index and column values):

for irow in df.itertuples():
         print irow

Check if dataframe column contains string


Increase the width of dataframe in jupyter display

pd.options.display.max_colwidth = 100

Estimate memory usage for dataframe
By column




Save numpy array to text file

np.savetxt('data.txt', A,fmt="%s")

Filter rows with null/nan values and write to CSV

.to_csv("./rows_missing_last.csv", sep='|', index=False, header=True)

Quick Introduction to Apache Spark

What is Spark

Spark is a fast and general purpose framework for cluster computing.
It is written in Scala but is available in Scala, Java and Python.
The main data abstraction in Spark is the RDD, or Resilient Distributed Dataset.
With this abstraction, Spark enables data to be distributed and processed among
the many nodes of a computing cluster.
It provides both interactive and programmatic interfaces. It consists of the following components:

  • Spark Core – the foundational classes on which Spark is built. It provides the RDD.
  • Spark Streaming – a protocol for processing streaming data
  • Spark SQL – an API for handling structured data
  • MLLib – a Machine Learning library
  • GraphX – a Graph library

Cluster Manager

In order to operate, in production mode Spark needs a cluster manager that manages data distribution, task scheduling and fault tolerance among the various nodes.
There are 3 types supported – Apache Mesos, Hadoop YARN and Spark standalone.

Spark Fundamentals

Spark, as has been previously defined is a framework for fast and general purpose cluster computing. Its fundamental abstraction is the RDD – the resilient distributed dataset which means that it is inherently parallelizable among the nodes and cores of a computing cluster. The RDD is held entirely in RAM.

When data is read into Spark, it is read into an RDD. Once it is read into an
RDD it can be operated on. There are 2 distinct types of operations on RDDs:

1. Transformations

Transformations are used to convert data in the RDD to another form. The result of a transformation is another RDD. Examples of transformations are:

  • map()– takes a function as argument and applies the function to each item/element of the RDD
  • flatMap() – takes a function as argument and applies the function to each element while “flattening” the results into a single level collection.
  • filter() – takes a boolean expression and returns an RDD with rows for which the boolean expression is true. e.g. lines of a file which contain the string “Obama”
  • countByKey() – given a Pair/map RDD i.e. with Key value pairs, return another Pair RDD with counts by key.

2. Actions

Actions are operations on an RDD which result in some sort of output that is not an RDD e.g. a list, DataFrame, or output to the screen. Examples of action operations are:

  • collect() – Applies the various transformations to an RDD then returns the result as a collection.
  • count – returns a count of the number of elements in an RDD
  • reduce() – takes a function and repeatedly applies it to the elements of the
    RDD to produce a single output value

RDDs and Lazy Evaluation

A fundamental idea in Spark’s implementation is the application of lazy evaluation and this is implemented for all Spark transformation operations.
Thus an RDD is fundamentally a data abstraction so, when we call say :

scala> val rdd = sc.parallelize(Seq(1,3,4,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :21

scala> val mappedRDD = => x*x)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :23

what we’re getting as mappedRDD is just an an expression that hasn’t been evaluated. This expression is essentially a record of a sequence operations that need to be evaluated i.e. parallelize –> map
This expression amounts to what is known as a lineage graph and can be seen as follows :

scala> mappedRDD.toDebugString
res4: String = 
(4) MapPartitionsRDD[1] at map at :23 []
 |  ParallelCollectionRDD[0] at parallelize at :21 []