After viewing this youtube video and following article How to Hadoop , I was able to successfully install Hortonworks Hadoop on a 4-node cluster on Amazon EC2. However I needed to make the following additions below to make it work properly:
1. Make sure that you install postgresql-8.4, not postgresql-9.1
If you did yum install postgresql, it may have installed postgresql-9.1
if that is the case, then you need to erase it:
yum erase postgresql
Download the 8.4 version
curl -O http://yum.postgresql.org/8.4/redhat/rhel-5-x86_64/pgdg-centos-8.4-3.noarch.rpm
yum install postgresql84
yum install postgresql84-server
2. edit the file /usr/sbin/ambari-server.py as follows:
PG_HBA_DIR = "/var/lib/pgsql/data/"
PG_HBA_DIR = "/var/lib/pgsql/8.4/data/"
3. Make sure that the bindir for postgresql is added to the PATH
4. Make sure that the “ambari-server” user is created and add the following
Otherwise, you will obtain the following error:
internal exception: org.postgresql.util.psqlexception: fatal: ident authentication failed for user "ambari-server"
Also, edit the
and add the following line:
host all all 127.0.0.1/32 md5
solution referenced from : http://stackoverflow.com/questions/4562471/connecting-to-local-instance-of-postgresql-with-jdbc
The Data Problem
The advent of the Web 2.0 age has enabled the collection of massive amounts of data from web logs, GPS tracking data, sensor data, twitter feeds, mobile phone logs, etc etc.
The nature of the data problem we face is not the collection of it, but rather how to process such huge amounts of data (in petabytes) often on a daily basis so as to obtain meaningful intelligence in order to make decisions on it. The data processing problem has coincided with the gradual erosion of Moore’s law in which the limits of processing power has been gradually reached, while at the same time the cost of hardware has kept falling.
MapReduce is a programming paradigm used to process large datasets by parallelizing and distributing the work among independent processes or tasks. It is derived from a functional programming model. MapReduce was proposed as an algorithm to help address the data processing problem by making use of the latest trends in hardware computing: cheap hardware and distributed computing.
The basic idea is to have cheap commodity machines arranged in clusters running independent tasks among which large data sets could be distributed and processed.
The advantages of MapReduce are as follows.
- Clean abstraction for developers
- Fault tolerance and recovery
- Monitoring and Status updates
- Automatic parallelization and Distribution
- Input-Output Scheduling
It consists of 2 main functions :
A list of key-value pairs is provided to the map function, which applies a function f to each element of the list, and returns a list of intermediate key-value pairs.
Reduce receives as input the list of intermediate key-value pairs produced by map and combines these together to produce the final output.
All values corr. to the same key will be processed at the same reducer node.
Applications of Map Reduce
- Inverted Index Construction
- Document Clustering
- Web link-graph traversal
- Statistical Machine Translation
- Machine Learning
Hadoop provides a open-source implementation of MapReduce.
It is a highly scalable distributed computing framework which enables distributed processing of huge data sets across a cluster of machines via MapReduce
It provides failure detection & failover at the application layer as well as high availability.
Hadoop Environment Setup
The key features of a Hadoop setup are as follows:
- provides data storage services for the shared file system
- Manages metadata of cluster and DataNodes
- Secondary NameNode
- Backup for NameNode
- Manages job & resources in cluster
- Manages tasks
- Does cluster balancing