Hadoop is a technology that enables distributed processing of a large set of data sets across clusters ranging from 1 server to thousands of servers ensuring a high degree of Fault Tolerance.

Hadoop is a framework that consists of the following basic modules:
1) Hadoop Common – contains libraries and utilities needed by other Hadoop modules
2) Hadoop Distributed File System (HDFS) – a distributed file system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
3) Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4) Hadoop MapReduce – a programming model for large scale data processing.

In this Tutorial we will install & configure Hadoop on Ubuntu(12.10/13.04/13.10).Follow the below steps:

Step 1: Update your machine.

[email protected]:~# apt-get update

Install python-software-properties module:

[email protected]:~# apt-get install python-software-properties

Add the java repository:

[email protected]:~# add-apt-repository ppa:webupd8team/java 
[email protected]:~# apt-get update && sudo apt-get upgrade
[email protected]:~# apt-get install oracle-java6-installer

Check the installed java version:

[email protected]:~# java -version 
java version "1.6.0_45" 
Java(TM) SE Runtime Environment (build 1.6.0_45-b06) 
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

If there are 2 versions of java as seen from the update-java-alternatives command. Then run the following command to set this java to the latest versions:

[email protected]:~# update-alternatives --config java

There is only one alternative in link group java: /usr/lib/jvm/java-6-oracle/jre/bin/java. Nothing to configure here.

Step 2: Adding a group to the system.

[email protected]:~# addgroup hadoopgroup

Adding a hadoop user to the earlier created group:

[email protected]:~# adduser --ingroup hadoopgroup hadoopuser 
Adding user `hadoopuser' ... 
Adding new user `hadoopuser' (1001) with group `hadoopgroup' ... 
Creating home directory `/home/hadoopuser' ... 
Copying files from `/etc/skel' ... 
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully 
Changing the user information for hadoopuser 
Enter the new value, or press ENTER for the default 
Full Name []: HADOOP USER 
Room Number []: 
Work Phone []: 
Home Phone []: 
Other []: 
Is the information correct? [Y/n] Y

Step 3: Create Passwordless authentication.

[email protected]:~# su - hadoopuser
[email protected]:~$ ssh-keygen -t rsa -P "" 
Generating public/private rsa key pair. 
Enter file in which to save the key (/home/hadoopuser/.ssh/id_rsa): 
Created directory '/home/hadoopuser/.ssh'. 
Your identification has been saved in /home/hadoopuser/.ssh/id_rsa. 
Your public key has been saved in /home/hadoopuser/.ssh/id_rsa.pub. 
The key fingerprint is: 
82:a0:cb:f4:fa:1f:ac:f5:29:54:34:e7:56:ee:b0:9f [email protected] 
The key's randomart image is: 
+--[ RSA 2048]----+ 
|                 | 
|       o . .     | 
|  .   . + o      | 
| . . . . + .     | 
|..  . o S +      | 
|o.. .. . . .     | 
|.. ..+    . .    | 
|  . o.o .  E     | 
| ..o...o         | 
+-----------------+ 
[email protected]:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 
[email protected]hadoop1:~$ ssh localhost 

we will get below message after login:

The authenticity of host 'localhost (127.0.0.1)' can't be established. 
ECDSA key fingerprint is ee:be:18:ef:e6:3d:e3:8d:8a:17:ca:d1:a3:d6:d6:49. 
Are you sure you want to continue connecting (yes/no)? yes 
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. 
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic x86_64)

Step 4: Disable ipv6.
As a root Append the file /etc/sysctl.conf and add the following lines:

[email protected]:~# vi /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Save & exit.

[email protected]:~# sysctl -p 
net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1 
net.ipv6.conf.lo.disable_ipv6 = 1

To verify the ipv6 has been disabled:

[email protected]:~# cat /proc/sys/net/ipv6/conf/all/disable_ipv6 
1

Step 5: Adding the hadoop repository.

[email protected]:~# add-apt-repository ppa:hadoop-ubuntu/stable

Update and upgrade.

[email protected]:~# apt-get update && apt-get upgrade

Step 6: Now Install Hadoop.

[email protected]:~# apt-get install hadoop

Verify the hadoopuser details:

[email protected]:~# id hadoopuser 
uid=1001(hadoopuser) gid=1002(hadoopgroup) groups=1002(hadoopgroup)

Add the hadoop user to the sudo file so that it will have the root level permissions:

and add the following line:

hadoopuser ALL=(ALL:ALL) ALL

Set the environment in the hadoop user .bashrc file as follows:

[email protected]:~# vi /home/hadoopuser/.bashrc 

and add the following line:

# Set Hadoop-related environment variables   
export HADOOP_HOME=/home/hadoopuser/hadoop  
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)`
export JAVA_HOME=/usr/lib/jvm/java-6-oracle/ 
# Add Hadoop bin/ directory to PATH  
export PATH=$PATH:$HADOOP_HOME/bin  
export PATH=$PATH:/usr/lib/hadoop/bin/

#Set some aliased
unalias fs &> /dev/null   
alias fs="hadoop fs"    
unalias hls &> /dev/null  
alias hls="fs -ls"

Step 7: Now Configure the Hadoop.

[email protected]:~# chown -R hadoopuser:hadoopgroup /var/log/hadoop/ 
[email protected]:~# chmod -R 755 /var/log/hadoop/ 
[email protected]:~# cd /usr/lib/hadoop/conf/ 
[email protected]:/usr/lib/hadoop/conf# ls -ltr 
total 76 
-rw-r--r-- 1 root hadoop  382 Mar 24  2012 taskcontroller.cfg 
-rw-r--r-- 1 root hadoop 1195 Mar 24  2012 ssl-server.xml.example 
-rw-r--r-- 1 root hadoop 1243 Mar 24  2012 ssl-client.xml.example 
-rw-r--r-- 1 root hadoop   10 Mar 24  2012 slaves 
-rw-r--r-- 1 root hadoop   10 Mar 24  2012 masters 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 mapred-site.xml 
-rw-r--r-- 1 root hadoop 2033 Mar 24  2012 mapred-queue-acls.xml 
-rw-r--r-- 1 root hadoop 4441 Mar 24  2012 log4j.properties 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 hdfs-site.xml 
-rw-r--r-- 1 root hadoop 4644 Mar 24  2012 hadoop-policy.xml 
-rw-r--r-- 1 root hadoop 1488 Mar 24  2012 hadoop-metrics2.properties 
-rw-r--r-- 1 root hadoop 2237 Mar 24  2012 hadoop-env.sh 
-rw-r--r-- 1 root hadoop  327 Mar 24  2012 fair-scheduler.xml 
-rw-r--r-- 1 root hadoop  178 Mar 24  2012 core-site.xml 
-rw-r--r-- 1 root hadoop  535 Mar 24  2012 configuration.xsl 
-rw-r--r-- 1 root hadoop 7457 Mar 24  2012 capacity-scheduler.xml 
[email protected]:/usr/lib/hadoop/conf#

But before we start using them, we need to modify several files in the /conf folder.

hadoop-env.sh

replace the JAVA_HOME line with the below line.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

core-site.xml

Create a base for temporary files.

[email protected]:/usr/lib/hadoop/conf# mkdir /home/hadoopuser/tmp 
[email protected]:/usr/lib/hadoop/conf# chown hadoopuser:hadoopgroup /home/hadoopuser/tmp/ 
[email protected]:/usr/lib/hadoop/conf# chmod 755 /home/hadoopuser/tmp/ 
[email protected]:/usr/lib/hadoop/conf#

Replace the original contents with:

[email protected]:/usr/lib/hadoop/conf# cat core-site.xml 
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="https://www.thegeekdiary.com/hadoop-installation-using-a-single-node-cluster/configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration> 
<property> 
<name>hadoop.tmp.dir</name> 
<value>/home/hadoopuser/tmp</value> 
<description>A base for other temporary directories.</description> 
</property> 
<property> 
<name>fs.default.name</name> 
<value>hdfs://localhost:54310</value> 
<description>The name of the default file system.  A URI whose  scheme and authority determine the FileSystem implementation.  The  uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class.  The uri's authority is used to 
determine the host, port, etc. for a filesystem.</description> 
</property> 
</configuration>
[email protected]:/usr/lib/hadoop/conf# cat mapred-site.xml
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="https://www.thegeekdiary.com/hadoop-installation-using-a-single-node-cluster/configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration> 
<property> 
<name>mapred.job.tracker</name> 
<value>localhost:54311</value> 
<description>The host and port that the MapReduce job tracker runs 
at.  If "local", then jobs are run in-process as a single map 
and reduce task. 
</description> 
</property> 
</configuration>
[email protected]:/usr/lib/hadoop/conf# cat hdfs-site.xml 
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="https://www.thegeekdiary.com/hadoop-installation-using-a-single-node-cluster/configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration> 
<property> 
<name>dfs.replication</name> 
<value>1</value> 
<description>Default block replication. 
The actual number of replications can be specified when the file is created. 
The default is used if replication is not specified in create time. 
</description> 
</property> 
</configuration>
[email protected]:/usr/lib# chown -R hadoopuser:hadoopgroup /usr/lib/hadoop/ 
[email protected]:/usr/lib# ls -ltr hadoop/ 
total 16 
lrwxrwxrwx 1 hadoopuser hadoopgroup   15 Apr 24  2012 pids -> /var/run/hadoop 
lrwxrwxrwx 1 hadoopuser hadoopgroup   15 Apr 24  2012 logs -> /var/log/hadoop 
lrwxrwxrwx 1 hadoopuser hadoopgroup   41 Apr 24  2012 hadoop-tools-1.0.2.jar -> ../../share/hadoop/hadoop-tools-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-test-1.0.2.jar -> ../../share/hadoop/hadoop-test-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   44 Apr 24  2012 hadoop-examples-1.0.2.jar -> ../../share/hadoop/hadoop-examples-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-core.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   40 Apr 24  2012 hadoop-core-1.0.2.jar -> ../../share/hadoop/hadoop-core-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   39 Apr 24  2012 hadoop-ant-1.0.2.jar -> ../../share/hadoop/hadoop-ant-1.0.2.jar 
lrwxrwxrwx 1 hadoopuser hadoopgroup   26 Apr 24  2012 contrib -> ../../share/hadoop/contrib 
lrwxrwxrwx 1 hadoopuser hadoopgroup   16 Apr 24  2012 conf -> /etc/hadoop/conf 
drwxr-xr-x 9 hadoopuser hadoopgroup 4096 Dec 15 05:16 webapps 
drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 libexec 
drwxr-xr-x 2 hadoopuser hadoopgroup 4096 Dec 15 05:16 bin 
drwxr-xr-x 3 hadoopuser hadoopgroup 4096 Dec 15 05:16 lib 
[email protected]:/usr/lib# 
[email protected]:/etc/hadoop/conf# chown -R hadoopuser:hadoopgroup /etc/hadoop/ 
[email protected]:/etc/hadoop/conf# 
[email protected]:~# su - hadoopuser 
[email protected]:~$ mkdir hadoop

Commands To Manage Hadoop Services

  • start-dfs.sh – Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before start-mapred.sh
  • stop-dfs.sh – Stops the Hadoop DFS daemons.
  • start-mapred.sh – Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
  • stop-mapred.sh – Stops the Hadoop Map/Reduce daemons.
  • start-all.sh – Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers. Deprecated; use start-dfs.sh then start-mapred.sh
  • stop-all.sh – Stops all Hadoop daemons. Deprecated; use stop-mapred.sh then stop-dfs.sh

Format the hadoop file system

To fomrat a hadoop FS:

[email protected]:/usr/lib/hadoop/conf# su - hadoopuser 
[email protected]:~$ hadoop namenode -format 
13/12/15 19:53:16 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 1.0.2 
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 
************************************************************/ 
13/12/15 19:53:17 INFO util.GSet: VM type       = 64-bit 
13/12/15 19:53:17 INFO util.GSet: 2% max memory = 19.33375 MB 
13/12/15 19:53:17 INFO util.GSet: capacity      = 2^21 = 2097152 entries 
13/12/15 19:53:17 INFO util.GSet: recommended=2097152, actual=2097152 
13/12/15 19:53:37 INFO namenode.FSNamesystem: fsOwner=hadoopuser 
13/12/15 19:53:37 INFO namenode.FSNamesystem: supergroup=supergroup 
13/12/15 19:53:37 INFO namenode.FSNamesystem: isPermissionEnabled=true 
13/12/15 19:53:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 
13/12/15 19:53:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 
13/12/15 19:53:37 INFO namenode.NameNode: Caching file names occuring more than 10 times 
13/12/15 19:53:58 INFO common.Storage: Image file of size 116 saved in 0 seconds. 
13/12/15 19:53:58 INFO common.Storage: Storage directory /home/hadoopuser/tmp/dfs/name has been successfully formatted. 
13/12/15 19:53:58 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: hadoop1.example.com: hadoop1.example.com 
************************************************************/

Starting the Hadoop Cluster using start-all.sh:

[email protected]:~$ start-all.sh 
Warning: $HADOOP_HOME is deprecated. 
starting namenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-namenode-hadoop1.example.com.out 
localhost: starting datanode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-datanode-hadoop1.example.com.out 
localhost: starting secondarynamenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-secondarynamenode-hadoop1.example.com.out 
starting jobtracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-jobtracker-hadoop1.example.com.out 
localhost: starting tasktracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-hadoopuser-tasktracker-hadoop1.example.com.out 
[email protected]:~$

or run:

start-dfs.sh  
start-mapred.sh

To check if hadoop is running or not, use the below command:

[email protected]:~$ jps 
35401 NameNode 
35710 JobTracker 
35627 SecondaryNameNode 
35928 Jps 
[email protected]:~$