Quick Introduction to Apache Spark

What is Spark

Spark is a fast and general purpose framework for cluster computing.
It is written in Scala but is available in Scala, Java and Python.
The main data abstraction in Spark is the RDD, or Resilient Distributed Dataset.
With this abstraction, Spark enables data to be distributed and processed among
the many nodes of a computing cluster.
It provides both interactive and programmatic interfaces. It consists of the following components:

  • Spark Core – the foundational classes on which Spark is built. It provides the RDD.
  • Spark Streaming – a protocol for processing streaming data
  • Spark SQL – an API for handling structured data
  • MLLib – a Machine Learning library
  • GraphX – a Graph library

Cluster Manager

In order to operate, in production mode Spark needs a cluster manager that manages data distribution, task scheduling and fault tolerance among the various nodes.
There are 3 types supported – Apache Mesos, Hadoop YARN and Spark standalone.

Spark Fundamentals

Spark, as has been previously defined is a framework for fast and general purpose cluster computing. Its fundamental abstraction is the RDD – the resilient distributed dataset which means that it is inherently parallelizable among the nodes and cores of a computing cluster. The RDD is held entirely in RAM.

When data is read into Spark, it is read into an RDD. Once it is read into an
RDD it can be operated on. There are 2 distinct types of operations on RDDs:

1. Transformations

Transformations are used to convert data in the RDD to another form. The result of a transformation is another RDD. Examples of transformations are:

  • map()– takes a function as argument and applies the function to each item/element of the RDD
  • flatMap() – takes a function as argument and applies the function to each element while “flattening” the results into a single level collection.
  • filter() – takes a boolean expression and returns an RDD with rows for which the boolean expression is true. e.g. lines of a file which contain the string “Obama”
  • countByKey() – given a Pair/map RDD i.e. with Key value pairs, return another Pair RDD with counts by key.

2. Actions

Actions are operations on an RDD which result in some sort of output that is not an RDD e.g. a list, DataFrame, or output to the screen. Examples of action operations are:

  • collect() – Applies the various transformations to an RDD then returns the result as a collection.
  • count – returns a count of the number of elements in an RDD
  • reduce() – takes a function and repeatedly applies it to the elements of the
    RDD to produce a single output value

RDDs and Lazy Evaluation

A fundamental idea in Spark’s implementation is the application of lazy evaluation and this is implemented for all Spark transformation operations.
Thus an RDD is fundamentally a data abstraction so, when we call say :

scala> val rdd = sc.parallelize(Seq(1,3,4,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :21

scala> val mappedRDD = rdd.map(x => x*x)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :23

what we’re getting as mappedRDD is just an an expression that hasn’t been evaluated. This expression is essentially a record of a sequence operations that need to be evaluated i.e. parallelize –> map
This expression amounts to what is known as a lineage graph and can be seen as follows :

scala> mappedRDD.toDebugString
res4: String = 
(4) MapPartitionsRDD[1] at map at :23 []
 |  ParallelCollectionRDD[0] at parallelize at :21 []

Leave a Reply

Your email address will not be published. Required fields are marked *