In this blog we will describe the steps and required configurations for setting up a distributed multi-node Apache Hadoop cluster.
Prerequisites
1. Single node hadoop cluster
{If you have not configured single node hadoop cluster yet, then click below to configure single node Hadoop cluster first.}
How to install single node hadoop cluster
After configuring single node Hadoop cluster, make clone of your single node cluster to set-up multi-node Hadoop cluster.
Cloning steps-
a> Right click on your Masternode (single node cluster), you will get a screen like below-
b> Select clone option
c > give a new name to clone machine-
make sure you have clicked on Reintialize the MAC address of all network cards –
Note- [ Reinitialize the mac address while cloning. ]
d > select Full clone
Now click on clone option it will take some time to make a new virtual machine ( Datanode).
Repeat the same process to make second Datanode.
Note- [ Reinitialize the mac address while cloning. ]
2.Networking
Networking plays an important role here, before merging single node cluster into a multi node cluster we need to make sure that all the node pings each other( they need to be connected on the same network / hub or both the machines can speak to each other).
In this blog, Network configuration for Hadoop Clusters are following-
IP Address for Masternode (Namenode) is – 192.168.10.100
IP Address of Datanode 1 (slave node) – 192.168.10.101
IP Address of Datanode 2 (slave node) – 192.168.10.102
Check the communication between master and slaves-
Ping through IP address-
1 |
ping 192.168.10.101 |
1 |
ping 192.168.10.102 |
If they are connecting then ping through there hostname-
1 |
ping dn1.mycluster.com |
1 |
ping dn2.mycluster.com |
Note- Verify pinging from slave nodes also, to check whether they are able to communicate with Master node or not. If you are getting acknowledgement, then you are able to communicate.
c) Verify password less ssh login –
1 |
ssh dn1.mycluster.com |
1 |
ssh dn2.mycluster.com |
d) Stop iptables of each Node( Namenode, Datanode1, Datanode2)-
1 |
sudo service iptables stop |
or
1 |
service iptables stop |
Come to your Master node (Namenode)-
Namenode Configuration –
Before configuring Master node (Namenode), make sure you have configured /etc/hosts file.
To configure /etc/hosts file-
1 |
sudo vi /etc/hosts |
1 2 3 4 5 |
192.168.10.100 namenode.mycluster.com 192.168.10.101 dn1.mycluster.com 192.168.10.102 dn2.mycluster.com |
Now follow the steps to make changes on each machine (Nodes) –
These are the changes have to be made on Master node (Namenode)
1) login your Master node (Namenode) and move on hadoop directory to make changes-
1 |
cd hadoop-2.6.0/etc/hadoop/ |
2) open core-site.xml and modify copy the following –
1 |
vi core-site.xml |
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>fs.default.name</name> <value>hdfs://namenode.mycluster.com:9000</value> </property> </configuration> |
3) open hdfs-site.xml
1 |
vi hdfs-site.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/hadoop/namenode</value> </property> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> |
Note:– Here <value>/home/hadoop/hadoop/namenode</value> ,
/home/hadoop is the home directory of hadoop user. you need to give your user directory name.
and rest part is directory name which we have created .
4) open mapred-site.xml
1 |
vi mapred-site.xml |
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
5) open yarn-site.xml and add these entries-
1 |
vi yarn-s ite.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>namenode.mycluster.com:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>namenode.mycluster.com:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>namenode.mycluster.com:8050</value> </property> </configuration> |
see the screen-shot below-
6) Restart the ssh service by typing the below command.
1 |
sudo service sshd start |
DataNode Configuration-
Before configuring Datanode make sure have configured /etc/hosts file.
To configure /etc/hosts file-
1 |
sudo vi /etc/hosts |
1 2 3 4 5 |
192.168.10.100 namenode.mycluster.com 192.168.10.101 dn1.mycluster.com 192.168.10.102 dn2.mycluster.com |
Follow the steps to update Datanode
1) Login to your Datanode and move on hadoop directory to make changes-
1 |
cd hadoop-2.6.0/etc/hadoop/ |
2) open core-site.xml and modify copy the following –
1 |
vi core-site.xml |
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>fs.default.name</name> <value>hdfs://namenode.mycluster.com:9000</value> </property> </configuration> |
3) open hdfs-site.xml
1 |
sudo vi hdfs-site.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/hadoop/datanode</value> </property> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> |
Note:– Here <value>/home/hadoop/hadoop/namenode</value> ,
/home/hadoop is home directory of hadoop user. you need to give your user directory name.
and rest part is directory name which we have created .
4) open yarn-site.xml
1 |
vi yarn-site.xml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>namenode.mycluster.com:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>namenode.mycluster.com:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>namenode.mycluster.com:8050</value> </property> </configuration> |
5) open mapred-site.xml
1 |
vi mapred-site.xml |
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
6) Restart the ssh service by typing the below command.
1 |
sudo service sshd start |
Note- Repeat the same steps for all DataNode Configuration.
Create /home/hadoop/hadoop/namenode directory to Master node (Namenode) and /home/hadoop/hadoop/datanode directory to both Datanodes(Slave Nodes)-
1 |
mkdir -p /home/hadoop/hadoop/namenode (on Master node only) |
1 |
mkdir -p /home/hadoop/hadoop/datanode (on Salves Node only) |
Note- If they are already exist then remove it and create new directories by above commands.
Login your Masternode (Namenode) and follow these steps to start your hadoop cluster-
To start all the daemons follow the below steps:
1) Format the NameNode first:
1 |
hadoop namenode -format |
2) Starting dfs daemons in Namenode-
Starting NameNode:
Type the below command to start dfs daemons:-
1 |
./start-dfs.sh |
3) type jps to see running daemons-
1 |
jps |
4) start yarn and historyserver daemons-
1 |
start-yarn.sh |
1 |
mr-jobhistory-daemon.sh start historyserver |
1 |
jps |
You can also use start-all.sh to start all daemons-
1 |
start-all.sh |
Login to your data node and verify the running daemons-
1 |
jps |
you can also check on another datanode-
here the screen shot, where you can see the running daemons on each Nodes-
6) Verify live slave nodes by hadoop dfsadmin report :-
1 |
hadoop dfsadmin -report |
Now open your browser and copy below addresses into url bar-
1 |
192.168.10.100:50070 |
you can see a screen like that –
This is your GUI ( a webserver of hadoop) for hadoop cluster.
Can you post a tutorial of how to install hadoop in windows.