Step by step Guide to Install Hadoop
and Run the Word Count Program
Softwares
required:
- Fedora
- Java 1.6.x and above
Steps for
Enabling root login via GUI in Fedora 13:
(for ubuntu jus type
sudo passwd)
Step 1. Open terminal and get into root by using command
[computer@localhost
~]# su
Password:
Step 2.After entering the password it will switch to root
user as
[root@localhost
~]#
Step 3.Now edit following files as
[root@localhost
~]# vi /etc/pam.d/gdm
· Search for the line – ‘auth required pam_succeed_if.so
user !=root quiet’
· Either delete the line or comment it by prefixing the
character ‘#’ as below
#auth required pam_succeed_if.so
user != root quiet
· Now save the updated file by Esc->:wq
· Following the similar procedure, update the file
/etc/pam.d/gdm-password. You should be able to
login as ‘root‘ via GUI now.
[root@localhost
~]# vi /etc/pam.d/gdm-password
Steps for
installing hadoop:
Step 1. Check ur java version
[root@localhost ~]# java –version
Step 2. Configuring the SSH access for our root user to access localhost without
password:
[root@localhost ~]# ssh-keygen -t
rsa -P ""
Generating
public/private rsa key pair.
Enter file in which to save the key
(/root/.ssh/id_rsa):
Your identification has been
saved in /root/.ssh/id_rsa.
Your public key has been
saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
2a:3e:61:08:bc:e5:9c:d7:f2:de:21:4a:6c:16:3f:fe
root@localhost.localdomain
[root@localhost ~]# cat
$HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Step 3. Place
the hadoop archive file hadoop-0.20.2.tar.gz
in /root/usr/local & after
that type the following commands:
[root@localhost ~]# cd /usr/local
[root@localhost local]# tar xvfz hadoop-0.20.2.tar.gz
[root@localhost local]# mv hadoop-0.20.2 hadoop
[root@localhost ~]# cd /usr/local
[root@localhost local]# tar xvfz hadoop-0.20.2.tar.gz
[root@localhost local]# mv hadoop-0.20.2 hadoop
Cd=change directory
Tar=archiving utility (similar to winzip)
X=extract
V=verbose (to output file names as you r
extracting)
F=file
Z=filter archive through gzip
Step 4. Check for the JAVA_HOME
path
[root@localhost local]# echo $JAVA_HOME
If no output then fire up this
command
Step 5.Also edit following two
files by using commands
[root@localhost ~]# vi /etc/environment
add the following line to it
:
JAVA_HOME=/usr/lib/jvm/java-1.6.0
[root@localhost ~]# vi /etc/.bashrc
add the following lines to it
:
JAVA_HOME=/usr/lib/jvm/java-1.6.0
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH
Step 6. For the working of Hadoop we need to set
configuration details in three main .xml files which are:
I) core-site.xml
ii) hdfs-site.xml
iii) mapred-site.xml
ii) hdfs-site.xml
iii) mapred-site.xml
This files can be located
at usr/local/hadoop/conf folder.
Open these files individually in gedit and copy-paste the entire
script which is given below
Core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-datastore</value>
<description>A base for other
temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the
default file system. A URI whose scheme and authority determine the FileSystem
implementation. The uri's scheme determines the config property
(fs.SCHEME.impl) naming the FileSystem implementation class. The uri's
authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Map-red.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port
that the MapReduce job tracker runs at. If "local", then jobs are run
in-process as a single map and reduce task.</description>
</property>
</configuration>
Hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>The actual number
of replications can be specified when the file is created. The default is used
if replication is not specified in create time.</description>
</property>
</configuration>
Step 7. Now create hadoop temporary directories base using the following
commands:
[root@localhost ~]# mkdir /usr/local/hadoop-datastore
[root@localhost ~]# chown -hR root /usr/local/hadoop-datastore
[root@localhost ~]# chmod 750 /usr/local/hadoop-datastore
[root@localhost ~]# mkdir /usr/local/hadoop-datastore
[root@localhost ~]# chown -hR root /usr/local/hadoop-datastore
[root@localhost ~]# chmod 750 /usr/local/hadoop-datastore
some info about above commands(useful during practical exams as external examiner is likely to ask questions related commands and there usage )
Chown: change owner and group info to root
Chmod: change file access permissions
-hR: forcibly and recursively
750 is octal (can also use )
R=4 w=2 x=1
7-full (rwx)
5-read and execute
0-no rights
Step 8.Now format the
filesystem (it simply initializes
the directory specified by the
dfs.name.dir variable)
[root@localhost local]# cd /usr/local/hadoop
[root@localhost hadoop]# bin/hadoop namenode –format
13/03/26
16:23:57 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG:
Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707;
compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
13/03/26
16:23:57 INFO namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
13/03/26
16:23:57 INFO namenode.FSNamesystem: supergroup=supergroup
13/03/26
16:23:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/03/26
16:23:57 INFO common.Storage: Image file of size 94 saved in 0 seconds.
13/03/26
16:23:58 INFO common.Storage: Storage directory
/usr/local/hadoop-datastore/dfs/name has been successfully formatted.
13/03/26
16:23:58 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG:
Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/
Step 9.Now create a sample input file as input.txt and place
it in /tmp/input/ folder
with ur txt for eg:
I live in Loni
I study in PREC
PREC is in Loni
I study in PREC
PREC is in Loni
Step 10.Now start the hadoop
services such that all the pseudo nodes become active and working using the
following command:
[root@localhost
hadoop]# bin/start-all.sh
starting
namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-namenode-localhost.localdomain.out
localhost:
starting datanode, logging to
/usr/local/hadoop/bin/../logs/hadoop-root-datanode-localhost.localdomain.out
localhost:
starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out
starting
jobtracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-root-jobtracker-localhost.localdomain.out
localhost:
starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-root-tasktracker-localhost.localdomain.out
Step 10. For the input which is set, we need to transfer it from the
local filesystem to the hdfs from where the mapreduce job can fetch the input.
We do it as follows:
[root@localhost hadoop]#
bin/hadoop dfs -copyFromLocal /tmp/input/ /user/input/
Step 11. run the mapreduce job by selecting an example program for
word-count which is provided in the hadoop jar file which we had put in
/usr/local/ as follows:
[root@localhost hadoop]#
bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/input /user/output
13/03/26
16:34:57 INFO input.FileInputFormat: Total input paths to process : 1
13/03/26
16:34:57 INFO mapred.JobClient: Running job: job_201303261624_0001
13/03/26
16:34:58 INFO mapred.JobClient: map 0%
reduce 0%
13/03/26
16:35:07 INFO mapred.JobClient: map 100%
reduce 0%
13/03/26
16:35:19 INFO mapred.JobClient: map 100%
reduce 100%
13/03/26
16:35:21 INFO mapred.JobClient: Job complete: job_201303261624_0001
13/03/26
16:35:21 INFO mapred.JobClient: Counters: 17
13/03/26
16:35:21 INFO mapred.JobClient: Job
Counters
13/03/26
16:35:21 INFO mapred.JobClient:
Launched reduce tasks=1
13/03/26
16:35:21 INFO mapred.JobClient:
Launched map tasks=1
13/03/26
16:35:21 INFO mapred.JobClient: Data-local
map tasks=1
13/03/26
16:35:21 INFO mapred.JobClient:
FileSystemCounters
13/03/26
16:35:21 INFO mapred.JobClient:
FILE_BYTES_READ=92
13/03/26
16:35:21 INFO mapred.JobClient:
HDFS_BYTES_READ=53
13/03/26
16:35:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=216
13/03/26
16:35:21 INFO mapred.JobClient:
HDFS_BYTES_WRITTEN=54
13/03/26
16:35:21 INFO mapred.JobClient:
Map-Reduce Framework
13/03/26
16:35:21 INFO mapred.JobClient:
Reduce input groups=8
13/03/26
16:35:21 INFO mapred.JobClient:
Combine output records=8
13/03/26
16:35:21 INFO mapred.JobClient: Map
input records=3
13/03/26
16:35:21 INFO mapred.JobClient:
Reduce shuffle bytes=0
13/03/26
16:35:21 INFO mapred.JobClient:
Reduce output records=8
13/03/26
16:35:21 INFO mapred.JobClient:
Spilled Records=16
13/03/26
16:35:21 INFO mapred.JobClient: Map
output bytes=100
13/03/26
16:35:21 INFO mapred.JobClient:
Combine input records=12
13/03/26
16:35:21 INFO mapred.JobClient: Map
output records=12
13/03/26 16:35:21 INFO mapred.JobClient: Reduce input records=7
Step 12. After the mapreduce job is over we check the list of files
present in /user/output and then display the output file using the following
commands
[root@localhost
hadoop]# bin/hadoop dfs -ls /user/output
Found 2
items
drwxr-xr-x - root supergroup 0 2013-03-26 16:34 /user/output/_logs
-rw-r--r-- 1 root supergroup 54 2013-03-26 16:35
/user/output/part-r-00000
Note: the
part-r-00000 file is the output file generated by the mapreduce program. But
this file is in the HDFS so we get it to the our local filesystem in
/root/tmp/output directory using the following commands
[root@localhost
hadoop]# mkdir /tmp/output
[root@localhost
hadoop]# bin/hadoop dfs -getmerge /user/output /tmp/output
[root@localhost
hadoop]# cat /tmp/output/output
I 2
Loni 2
PREC 2
in 3
is 1
live 1
study 1
[root@localhost
hadoop]#
That’ it! Hope it helped you. Keep visiting this blog for regular updates and useful study material.
is it possible to install it on windows....
ReplyDeleteya...its possible to install hadoop in windows by using cygwin (simulator cum software to run native Linux apps on Windows)
ReplyDelete