MapReduce and Hadoop – an Introduction

The Data Problem

The advent of the Web 2.0 age has enabled the collection of massive amounts of data from web logs, GPS tracking data, sensor data, twitter feeds, mobile phone logs, etc etc.

The nature of the data problem we face is not the collection of it, but rather how to process such huge amounts of data (in petabytes) often on a daily basis so as to obtain meaningful intelligence in order to make decisions on it. The data processing problem has coincided with the gradual erosion of Moore’s law in which the limits of processing power has been gradually reached, while at the same time the cost of hardware has kept falling.

Enter MapReduce

MapReduce is a programming paradigm used to process large datasets by parallelizing and distributing the work among independent processes or tasks. It is derived from a functional programming model. MapReduce was proposed as an algorithm to help address the data processing problem by making use of the latest trends in hardware computing: cheap hardware and distributed computing.
The basic idea is to have cheap commodity machines arranged in clusters running independent tasks among which large data sets could be distributed and processed.

The advantages of MapReduce are as follows.
It provides:

  • Clean abstraction for developers
  • Fault tolerance and recovery
  • Monitoring and Status updates
  • Automatic parallelization and Distribution
  • Input-Output Scheduling

Basic Idea

It consists of 2 main functions :
A list of key-value pairs is provided to the map function, which applies a function f to each element of the list, and returns a list of intermediate key-value pairs.
Reduce receives as input the list of intermediate key-value pairs produced by map and combines these together to produce the final output.
All values corr. to the same key will be processed at the same reducer node.

Applications of Map Reduce

  • Inverted Index Construction
  • Document Clustering
  • Web link-graph traversal
  • Statistical Machine Translation
  • Machine Learning


Hadoop provides a open-source implementation of MapReduce.
It is a highly scalable distributed computing framework which enables distributed processing of huge data sets across a cluster of machines via MapReduce
It provides failure detection & failover at the application layer as well as high availability.

Hadoop Environment Setup

The key features of a Hadoop setup are as follows:

provides data storage services for the shared file system
Manages metadata of cluster and DataNodes
Secondary NameNode
Backup for NameNode
Manages job & resources in cluster
Manages tasks
Does cluster balancing

Data Science, Big Data and MapReduce

To start one’s journey into the world of Data Science and Analytics, it behooves one to gain an understanding of the most touted terms in this field. The paper Data Science and Distributed Intelligence provides a good explanation of the often used terms Big Data, Data Science and Map Reduce. Here is my synopsis of these terms based on a reading of this paper:

Big Data refers to the proliferation of large amounts of data in huge databases and very high rate streaming data that produces it. Examples are : web server logs due to user clicks, cell phone logs, twitter feeds, facebook comments etc. An interesting fact is that 90% of Big Data was produced only in the last 2 years.

Data Science is a term that describes the process of gathering data, analyzing it and obtaining information out of it to produce a “data product”. Such data may involve large unstructured data sets. Hence the term ‘Data Science’ is often used in conjunction with Big Data.Since the processing techniques for Big Data often involve functional programming paradigms map and reduce, this has given rise to the MapReduce algorithm, popularized by Google.

MapReduce is an algorithm that helps mitigate the problem of processing Big Data and large data sets by breaking down the processing into smaller more manageable chunks.
Hadoop is the popular open source implementation of MapReduce