Multi Node Cluster Setup
This tutorial describes how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop environment. This is practically used in organizations to store and analyze their Petabytes and Exabytes of data.
Recommended Platform
OS: Linux is supported as a development and production platform. You can use Ubuntu 14.04 or 16.04 or later (you can also use other Linux flavors like CentOS, Redhat, etc.)
Hadoop: Cloudera Distribution for Apache Hadoop CDH5.x (you can use Apache Hadoop 2.x)
Install Hadoop on Master
Let us now start with installing Hadoop on master node in the distributed mode.
Prerequisites for Hadoop Multi Node Cluster Setup
1. Add Entries in hosts file
Edit hosts file and add entries of both master and slaves
Command
sudo vi /etc/hosts
MASTER-IP master SLAVE01-IP slave01 SLAVE02-IP slave02
Note
In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the corresponding IP).
Example
172.26.110.100 master 172.26.110.101 slave01 172.26.110.102 slave02
2. Install Java 8 (Recommended Oracle Java)
Hadoop requires a working Java 1.5+ installation. However, using Java 8 is recommended for running Hadoop.
2.1 Install Python Software Properties
Command
sudo apt-get install python-software-properties
2.2 Add Repository
Command
sudo add-apt-repository ppa:webupd8team/java
2.3 Update the source list
Command
sudo apt-get update
2.4 Install Java
Command
sudo apt-get install oracle-java8-installer
3. Configure SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it.
3.1 Install Open SSH Server-Client
Command
sudo apt-get install openssh-server openssh-client
3.2 Generate KeyPairs
Command
ssh-keygen -t rsa -P ""
3.3 Configure password-less SSH
3.3.1 Copy the generated ssh key to master node’s authorized keys.
Command
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
3.3.2 Copy the master node’s ssh key to slave’s authorized keys.
Command
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave01 ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave02
3.4 Check by SSH to all the Slaves
Command
ssh slave01 ssh slave02
2. Install Hadoop
1. Download Hadoop
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.3.2.tar.gz
Note
You can download any version of hadoop version 2+. Here I am using CDH version is Cloudera’s 100% open source platform distribution.
2. Untar Tar ball
Command
tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz
3. Hadoop multi-node cluster setup Configuration
1. Edit .bashrc
Edit .bashrc file located in user’s home directory and add following parameters.
Command
vi .bashrc
export HADOOP_PREFIX="/home/ashok/hadoop-2.5.0-cdh5.3.2" export PATH=$PATH:$HADOOP_PREFIX/bin export PATH=$PATH:$HADOOP_PREFIX/sbin export HADOOP_MAPRED_HOME=${HADOOP_PREFIX} export HADOOP_COMMON_HOME=${HADOOP_PREFIX} export HADOOP_HDFS_HOME=${HADOOP_PREFIX} export YARN_HOME=${HADOOP_PREFIX}
Note
After above step restart the terminal, so that all the environment variables will come into effect or execute the source command.
Command
source .bashrc
2. Edit hadoop-env.sh
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc. Edit configuration file hadoop-env.sh (located in HADOOP_HOME/etc/hadoop) and set JAVA_HOME.
Command
vi hadoop–env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
3. Edit core-site.xml
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
Edit configuration file core-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries.
Command
vi core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/ashok/hdata</value> </property> </configuration>
Note
Here /home/ashok/hdata is a sample location; please specify a location where you have Read Write privileges.
4. Edit hdfs-site.xml
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
Edit configuration file hdfs-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command
vi hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>
5. Edit mapred-site.xml
mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml file using mapred-site.xml template. Edit configuration file mapred-site.xml (located in HADOOP_HOME/ etc/hadoop) and add following entries
Command
vi mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
6. Edit yarn-site.xml
yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.Edit configuration file mapred-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries
Command
vi yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8040</value> </property> </configuration>
7. Edit salves
Edit configuration file slaves (located in HADOOP_HOME/etc/hadoop) and add following entries:
slave01 slave02
Now Hadoop is set up on Master, now setup Hadoop on all the Slaves.
Install Hadoop On Slaves
1. Setup Prerequisites on all the slaves
Run following steps on all the slaves
1. Add Entries in hosts file
2. Install Java 8 (Recommended Oracle Java)
2. Copy configured setups from master to all the slaves
2.1. Create tarball of configured setup
Command
tar czf hadoop.tar.gz hadoop-2.5.0-cdh5.3.2
Note
Run this command on Master
2.2. Copy the configured tarball on all the slaves
Command
scp hadoop.tar.gz slave01:~ scp hadoop.tar.gz slave02:~
Note
Run this command on Master
2.3. Un-tar configured Hadoop setup on all the slaves
Command
tar xvzf hadoop.tar.gz
Note
Run this command on all slaves.
Now Hadoop is set up on all the Slaves. Now Start the Cluster.
4. Start the Hadoop Cluster
Let us now learn how to start Hadoop cluster?
4.1. Format the name node
Command
bin/hdfs namenode -format
Note
- Run this command on Master
- This activity should be done once when you install Hadoop, else it will delete all the data from HDFS
4.2. Start HDFS Services
Command
sbin/start-dfs.sh
Note
Run this command on Master
4.3. Start YARN Services
Command
sbin/start-yarn.sh
Note
Run this command on Master
4.4. Check for Hadoop services
4.4.1. Check daemons on Master
Command
jps
NameNode
ResourceManager
4.4.2. Check daemons on Slaves
Command
jps
DataNode
NodeManager
5. Stop The Hadoop Cluster
Let us now see how to stop the Hadoop cluster?
5.1. Stop YARN Services
Command
sbin/stop-yarn.sh
Note
Run this command on Master
5.2. Stop HDFS Services
Command
sbin/stop-dfs.sh
Note
Run this command on Master