Friday, 4 April 2014

Step by step guide to install Hadoop and Run the Word Count Program

Step by step Guide to Install Hadoop

and Run the Word Count Program

Softwares required:

  •      Fedora
  •     Java 1.6.x and above


Steps for Enabling root login via GUI in Fedora 13:
(for ubuntu jus type sudo passwd)

Step 1. Open terminal and get into root by using command
[computer@localhost ~]#  su
Password:
Step  2.After entering the password it will switch to root user as
[root@localhost ~]#
Step 3.Now edit following files as
[root@localhost ~]#  vi /etc/pam.d/gdm
·       Search for the line – ‘auth required pam_succeed_if.so user !=root quiet’
·       Either delete the line or comment it by prefixing the character ‘#’ as below
                     #auth required pam_succeed_if.so user != root quiet
·       Now save the updated file by        Esc->:wq

·       Following the similar procedure, update the file
 /etc/pam.d/gdm-password. You should be able to login as ‘root‘ via GUI now.

[root@localhost ~]#  vi /etc/pam.d/gdm-password

Steps for installing hadoop:

here's the link to download hadoop archive file...Hadoop-0.20.2.tar.gz

Step  1. Check ur java version
      [root@localhost ~]#  java –version
Step  2. Configuring the SSH access for our  root user to access localhost without password:
      [root@localhost ~]#  ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa):
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
2a:3e:61:08:bc:e5:9c:d7:f2:de:21:4a:6c:16:3f:fe root@localhost.localdomain

       [root@localhost ~]# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step  3. Place the hadoop archive file hadoop-0.20.2.tar.gz  in  /root/usr/local & after that type the following commands:

[root@localhost ~]# cd /usr/local
[root@localhost local]# tar xvfz hadoop-0.20.2.tar.gz
[root@localhost local]# mv hadoop-0.20.2 hadoop


Cd=change directory
Tar=archiving utility (similar to winzip)
X=extract
V=verbose (to output file names as you  r  extracting)
F=file
Z=filter archive through gzip

Step 4. Check for the JAVA_HOME path

[root@localhost local]#  echo $JAVA_HOME

If  no output then  fire up this  command


Step  5.Also edit following two files by using commands

[root@localhost ~]# vi /etc/environment

add the following line to it :

JAVA_HOME=/usr/lib/jvm/java-1.6.0

[root@localhost ~]# vi /etc/.bashrc

add the following lines to it :

JAVA_HOME=/usr/lib/jvm/java-1.6.0
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

 Step 6. For  the working of Hadoop we need to set configuration details in three main .xml files which are:

I) core-site.xml
ii) hdfs-site.xml
iii) mapred-site.xml

This files can be located at  usr/local/hadoop/conf  folder.
Open these files individually in gedit and copy-paste the entire script which  is given below

Core-site.xml:

<configuration>
               <property>
                    <name>hadoop.tmp.dir</name>
                    <value>/usr/local/hadoop-datastore</value>
                    <description>A base for other temporary directories.</description>
               </property>
                         <property>
                    <name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
                    <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
               </property>
          </configuration>


Map-red.xml:

<configuration>
          <property>
                    <name>mapred.job.tracker</name>
                    <value>localhost:54311</value>
                    <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
               </property>
</configuration>
 

Hdfs-site.xml:

<configuration>
          <property>
                    <name>dfs.replication</name>
                    <value>1</value>
                    <description>The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
               </property>
</configuration>


Step 7. Now create hadoop temporary directories base using the following commands:

[root@localhost ~]# mkdir /usr/local/hadoop-datastore
[root@localhost ~]# chown  -hR root /usr/local/hadoop-datastore
[root@localhost ~]# chmod 750 /usr/local/hadoop-datastore

some info about above commands(useful during practical exams as external examiner is likely to ask questions related commands and there usage )
Chown: change owner and group info to root
Chmod: change file access permissions
-hR: forcibly and recursively

750  is octal (can also use )
R=4 w=2 x=1
7-full (rwx)
5-read and execute
0-no rights

Step 8.Now format the  filesystem  (it simply initializes the directory specified by   the dfs.name.dir variable)

[root@localhost local]# cd /usr/local/hadoop
[root@localhost  hadoop]#  bin/hadoop namenode –format

13/03/26 16:23:57 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
13/03/26 16:23:57 INFO namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
13/03/26 16:23:57 INFO namenode.FSNamesystem: supergroup=supergroup
13/03/26 16:23:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/03/26 16:23:57 INFO common.Storage: Image file of size 94 saved in 0 seconds.
13/03/26 16:23:58 INFO common.Storage: Storage directory /usr/local/hadoop-datastore/dfs/name has been successfully formatted.
13/03/26 16:23:58 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

Step 9.Now create a sample input file as input.txt and  place  it in  /tmp/input/   folder  with  ur  txt for eg:

I live in Loni
I study in PREC
PREC  is in Loni

Step 10.Now  start the hadoop services such that all the pseudo nodes become active and working using the following command:

         [root@localhost hadoop]# bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-root-jobtracker-localhost.localdomain.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-root-tasktracker-localhost.localdomain.out


Step 10. For the input which is set, we need to transfer it from the local filesystem to the hdfs from where the mapreduce job can fetch the input. We do it as follows:

[root@localhost hadoop]#  bin/hadoop dfs -copyFromLocal /tmp/input/ /user/input/

Step 11. run the mapreduce job by selecting an example program for word-count which is provided in the hadoop jar file which we had put in /usr/local/ as follows:

[root@localhost hadoop]#  bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/input  /user/output

13/03/26 16:34:57 INFO input.FileInputFormat: Total input paths to process : 1
13/03/26 16:34:57 INFO mapred.JobClient: Running job: job_201303261624_0001
13/03/26 16:34:58 INFO mapred.JobClient:  map 0% reduce 0%
13/03/26 16:35:07 INFO mapred.JobClient:  map 100% reduce 0%
13/03/26 16:35:19 INFO mapred.JobClient:  map 100% reduce 100%
13/03/26 16:35:21 INFO mapred.JobClient: Job complete: job_201303261624_0001
13/03/26 16:35:21 INFO mapred.JobClient: Counters: 17
13/03/26 16:35:21 INFO mapred.JobClient:   Job Counters
13/03/26 16:35:21 INFO mapred.JobClient:     Launched reduce tasks=1
13/03/26 16:35:21 INFO mapred.JobClient:     Launched map tasks=1
13/03/26 16:35:21 INFO mapred.JobClient:     Data-local map tasks=1
13/03/26 16:35:21 INFO mapred.JobClient:   FileSystemCounters
13/03/26 16:35:21 INFO mapred.JobClient:     FILE_BYTES_READ=92
13/03/26 16:35:21 INFO mapred.JobClient:     HDFS_BYTES_READ=53
13/03/26 16:35:21 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=216
13/03/26 16:35:21 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=54
13/03/26 16:35:21 INFO mapred.JobClient:   Map-Reduce Framework
13/03/26 16:35:21 INFO mapred.JobClient:     Reduce input groups=8
13/03/26 16:35:21 INFO mapred.JobClient:     Combine output records=8
13/03/26 16:35:21 INFO mapred.JobClient:     Map input records=3
13/03/26 16:35:21 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/03/26 16:35:21 INFO mapred.JobClient:     Reduce output records=8
13/03/26 16:35:21 INFO mapred.JobClient:     Spilled Records=16
13/03/26 16:35:21 INFO mapred.JobClient:     Map output bytes=100
13/03/26 16:35:21 INFO mapred.JobClient:     Combine input records=12
13/03/26 16:35:21 INFO mapred.JobClient:     Map output records=12
13/03/26 16:35:21 INFO mapred.JobClient:     Reduce input records=7

Step 12. After the mapreduce job is over we check the list of files present in /user/output and then display the output file using the following commands

[root@localhost hadoop]# bin/hadoop dfs -ls /user/output
Found 2 items
drwxr-xr-x   - root supergroup          0 2013-03-26 16:34 /user/output/_logs
-rw-r--r--   1 root supergroup         54 2013-03-26 16:35 /user/output/part-r-00000

Note:   the part-r-00000 file is the output file generated by the mapreduce program. But this file is in the HDFS so we get it to the our local filesystem in /root/tmp/output directory using the following commands

[root@localhost hadoop]# mkdir /tmp/output
[root@localhost hadoop]# bin/hadoop dfs -getmerge /user/output /tmp/output
[root@localhost hadoop]# cat /tmp/output/output

I                     2
Loni               2
PREC            2
in        3
is        1
live     1
study  1
[root@localhost hadoop]#




That’ it! Hope it helped you. Keep visiting this blog for regular updates and useful study material.

Like our Facebook page 

You can follow us on twitter to keep yourself updated on all the latest from techworld....

For any doubts,queries.,etc just comment below...

2 comments:

  1. is it possible to install it on windows....

    ReplyDelete
  2. ya...its possible to install hadoop in windows by using cygwin (simulator cum software to run native Linux apps on Windows)

    ReplyDelete

comment here......