Techytantra: Step by step guide to install Hadoop and Run the Word Count Program

Step by step Guide to Install Hadoop

and Run the Word Count Program

Softwares required:

Fedora

Java 1.6.x and above

Steps for Enabling root login via GUI in Fedora 13:

(for ubuntu jus type sudo passwd)

Step 1. Open terminal and get into root by using command

[computer@localhost ~]# su

Password:

Step 2.After entering the password it will switch to root user as

[root@localhost ~]#

Step 3.Now edit following files as

[root@localhost ~]# vi /etc/pam.d/gdm

· Search for the line – ‘auth required pam_succeed_if.so user !=root quiet’

· Either delete the line or comment it by prefixing the character ‘#’ as below

#auth required pam_succeed_if.so user != root quiet

· Now save the updated file by Esc->:wq

· Following the similar procedure, update the file

/etc/pam.d/gdm-password. You should be able to login as ‘root‘ via GUI now.

[root@localhost ~]# vi /etc/pam.d/gdm-password

Steps for installing hadoop:

here's the link to download hadoop archive file...Hadoop-0.20.2.tar.gz

Step 1. Check ur java version

[root@localhost ~]# java –version

Step 2. Configuring the SSH access for our root user to access localhost without password:

[root@localhost ~]# ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa):

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

2a:3e:61:08:bc:e5:9c:d7:f2:de:21:4a:6c:16:3f:fe root@localhost.localdomain

[root@localhost ~]# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 3. Place the hadoop archive file hadoop-0.20.2.tar.gz in /root/usr/local & after that type the following commands:

[root@localhost ~]# cd /usr/local
[root@localhost local]# tar xvfz hadoop-0.20.2.tar.gz
[root@localhost local]# mv hadoop-0.20.2 hadoop

Cd=change directory

Tar=archiving utility (similar to winzip)

X=extract

V=verbose (to output file names as you r extracting)

F=file

Z=filter archive through gzip

Step 4. Check for the JAVA_HOME path

[root@localhost local]# echo $JAVA_HOME

If no output then fire up this command

[root@localhost ~]# export JAVA_HOME=/usr/lib/jvm/java-1.6.0

Step 5.Also edit following two files by using commands

[root@localhost ~]# vi /etc/environment

add the following line to it :

JAVA_HOME=/usr/lib/jvm/java-1.6.0

[root@localhost ~]# vi /etc/.bashrc

add the following lines to it :

JAVA_HOME=/usr/lib/jvm/java-1.6.0
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

Step 6. For the working of Hadoop we need to set configuration details in three main .xml files which are:

I) core-site.xml
ii) hdfs-site.xml
iii) mapred-site.xml

This files can be located at usr/local/hadoop/conf folder.

Open these files individually in gedit and copy-paste the entire script which is given below

Core-site.xml:

<name>hadoop.tmp.dir</name>

<value>/usr/local/hadoop-datastore</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>

</property>

</configuration>

Map-red.xml:

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>

</property>

</configuration>

Hdfs-site.xml:

<name>dfs.replication</name>

<description>The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>

</property>

</configuration>

Step 7. Now create hadoop temporary directories base using the following commands:

[root@localhost ~]# mkdir /usr/local/hadoop-datastore
[root@localhost ~]# chown -hR root /usr/local/hadoop-datastore
[root@localhost ~]# chmod 750 /usr/local/hadoop-datastore

some info about above commands(useful during practical exams as external examiner is likely to ask questions related commands and there usage )

Chown: change owner and group info to root

Chmod: change file access permissions

-hR: forcibly and recursively

750 is octal (can also use )

R=4 w=2 x=1

7-full (rwx)

5-read and execute

0-no rights

Step 8.Now format the filesystem (it simply initializes the directory specified by the dfs.name.dir variable)

[root@localhost local]# cd /usr/local/hadoop

[root@localhost hadoop]# bin/hadoop namenode –format

13/03/26 16:23:57 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = localhost.localdomain/127.0.0.1

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 0.20.2

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010

************************************************************/

13/03/26 16:23:57 INFO namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel

13/03/26 16:23:57 INFO namenode.FSNamesystem: supergroup=supergroup

13/03/26 16:23:57 INFO namenode.FSNamesystem: isPermissionEnabled=true

13/03/26 16:23:57 INFO common.Storage: Image file of size 94 saved in 0 seconds.

13/03/26 16:23:58 INFO common.Storage: Storage directory /usr/local/hadoop-datastore/dfs/name has been successfully formatted.

13/03/26 16:23:58 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1

************************************************************/

Step 9.Now create a sample input file as input.txt and place it in /tmp/input/ folder with ur txt for eg:

I live in Loni
I study in PREC
PREC is in Loni

Step 10.Now start the hadoop services such that all the pseudo nodes become active and working using the following command:

[root@localhost hadoop]# bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-namenode-localhost.localdomain.out

localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-datanode-localhost.localdomain.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out

starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-root-jobtracker-localhost.localdomain.out

localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-root-tasktracker-localhost.localdomain.out

Step 10. For the input which is set, we need to transfer it from the local filesystem to the hdfs from where the mapreduce job can fetch the input. We do it as follows:

[root@localhost hadoop]# bin/hadoop dfs -copyFromLocal /tmp/input/ /user/input/

Step 11. run the mapreduce job by selecting an example program for word-count which is provided in the hadoop jar file which we had put in /usr/local/ as follows:

[root@localhost hadoop]# bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/input /user/output

13/03/26 16:34:57 INFO input.FileInputFormat: Total input paths to process : 1

13/03/26 16:34:57 INFO mapred.JobClient: Running job: job_201303261624_0001

13/03/26 16:34:58 INFO mapred.JobClient: map 0% reduce 0%

13/03/26 16:35:07 INFO mapred.JobClient: map 100% reduce 0%

13/03/26 16:35:19 INFO mapred.JobClient: map 100% reduce 100%

13/03/26 16:35:21 INFO mapred.JobClient: Job complete: job_201303261624_0001

13/03/26 16:35:21 INFO mapred.JobClient: Counters: 17

13/03/26 16:35:21 INFO mapred.JobClient: Job Counters

13/03/26 16:35:21 INFO mapred.JobClient: Launched reduce tasks=1

13/03/26 16:35:21 INFO mapred.JobClient: Launched map tasks=1

13/03/26 16:35:21 INFO mapred.JobClient: Data-local map tasks=1

13/03/26 16:35:21 INFO mapred.JobClient: FileSystemCounters

13/03/26 16:35:21 INFO mapred.JobClient: FILE_BYTES_READ=92

13/03/26 16:35:21 INFO mapred.JobClient: HDFS_BYTES_READ=53

13/03/26 16:35:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=216

13/03/26 16:35:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=54

13/03/26 16:35:21 INFO mapred.JobClient: Map-Reduce Framework

13/03/26 16:35:21 INFO mapred.JobClient: Reduce input groups=8

13/03/26 16:35:21 INFO mapred.JobClient: Combine output records=8

13/03/26 16:35:21 INFO mapred.JobClient: Map input records=3

13/03/26 16:35:21 INFO mapred.JobClient: Reduce shuffle bytes=0

13/03/26 16:35:21 INFO mapred.JobClient: Reduce output records=8

13/03/26 16:35:21 INFO mapred.JobClient: Spilled Records=16

13/03/26 16:35:21 INFO mapred.JobClient: Map output bytes=100

13/03/26 16:35:21 INFO mapred.JobClient: Combine input records=12

13/03/26 16:35:21 INFO mapred.JobClient: Map output records=12

13/03/26 16:35:21 INFO mapred.JobClient: Reduce input records=7

Step 12. After the mapreduce job is over we check the list of files present in /user/output and then display the output file using the following commands

[root@localhost hadoop]# bin/hadoop dfs -ls /user/output

Found 2 items

drwxr-xr-x - root supergroup 0 2013-03-26 16:34 /user/output/_logs

-rw-r--r-- 1 root supergroup 54 2013-03-26 16:35 /user/output/part-r-00000

Note: the part-r-00000 file is the output file generated by the mapreduce program. But this file is in the HDFS so we get it to the our local filesystem in /root/tmp/output directory using the following commands

[root@localhost hadoop]# mkdir /tmp/output

[root@localhost hadoop]# bin/hadoop dfs -getmerge /user/output /tmp/output

[root@localhost hadoop]# cat /tmp/output/output

I 2

Loni 2

PREC 2

in 3

is 1

live 1

study 1

[root@localhost hadoop]#

That’ it! Hope it helped you. Keep visiting this blog for regular updates and useful study material.

Like our Facebook page

You can follow us on twitter to keep yourself updated on all the latest from techworld....

For any doubts,queries.,etc just comment below...

Techytantra

Friday, 4 April 2014

Step by step guide to install Hadoop and Run the Word Count Program

Step by step Guide to Install Hadoop

and Run the Word Count Program

Like our Facebook page

You can follow us on twitter to keep yourself updated on all the latest from techworld....

2 comments: