MapReduce and Hadoop – an Introduction

The Data Problem

The advent of the Web 2.0 age has enabled the collection of massive amounts of data from web logs, GPS tracking data, sensor data, twitter feeds, mobile phone logs, etc etc.

The nature of the data problem we face is not the collection of it, but rather how to process such huge amounts of data (in petabytes) often on a daily basis so as to obtain meaningful intelligence in order to make decisions on it. The data processing problem has coincided with the gradual erosion of Moore’s law in which the limits of processing power has been gradually reached, while at the same time the cost of hardware has kept falling.

Enter MapReduce

MapReduce is a programming paradigm used to process large datasets by parallelizing and distributing the work among independent processes or tasks. It is derived from a functional programming model. MapReduce was proposed as an algorithm to help address the data processing problem by making use of the latest trends in hardware computing: cheap hardware and distributed computing.
The basic idea is to have cheap commodity machines arranged in clusters running independent tasks among which large data sets could be distributed and processed.

The advantages of MapReduce are as follows.
It provides:

  • Clean abstraction for developers
  • Fault tolerance and recovery
  • Monitoring and Status updates
  • Automatic parallelization and Distribution
  • Input-Output Scheduling

Basic Idea

It consists of 2 main functions :
Map([key,value])
A list of key-value pairs is provided to the map function, which applies a function f to each element of the list, and returns a list of intermediate key-value pairs.
Reduce([key,value])
Reduce receives as input the list of intermediate key-value pairs produced by map and combines these together to produce the final output.
All values corr. to the same key will be processed at the same reducer node.

Applications of Map Reduce

  • Inverted Index Construction
  • Document Clustering
  • Web link-graph traversal
  • Statistical Machine Translation
  • Machine Learning

Hadoop

Hadoop provides a open-source implementation of MapReduce.
It is a highly scalable distributed computing framework which enables distributed processing of huge data sets across a cluster of machines via MapReduce
It provides failure detection & failover at the application layer as well as high availability.

Hadoop Environment Setup

The key features of a Hadoop setup are as follows:

DataNode
provides data storage services for the shared file system
NameNode
Manages metadata of cluster and DataNodes
Secondary NameNode
Backup for NameNode
JobTracker
Manages job & resources in cluster
TaskTracker
Manages tasks
Balancer
Does cluster balancing