Multi Node Cluster Setup

This tutorial describes how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop environment. This is practically used in organizations to store and analyze their Petabytes and Exabytes of data.

Recommended Platform

OS: Linux is supported as a development and production platform. You can use Ubuntu 14.04 or 16.04 or later (you can also use other Linux flavors like CentOS, Redhat, etc.)

Hadoop: Cloudera Distribution for Apache Hadoop CDH5.x (you can use Apache Hadoop 2.x)

Install Hadoop on Master

Let us now start with installing Hadoop on master node in the distributed mode.

Prerequisites for Hadoop Multi Node Cluster Setup

1. Add Entries in hosts file

Edit hosts file and add entries of both master and slaves

Command

sudo vi /etc/hosts

MASTER-IP master
SLAVE01-IP slave01
SLAVE02-IP slave02

Note

In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the corresponding IP).

Example

172.26.110.100 master
172.26.110.101 slave01
172.26.110.102 slave02

2. Install Java 8 (Recommended Oracle Java)

Hadoop requires a working Java 1.5+ installation. However, using Java 8 is recommended for running Hadoop.

2.1 Install Python Software Properties

Command

sudo apt-get install python-software-properties

2.2 Add Repository

Command

sudo add-apt-repository ppa:webupd8team/java

2.3 Update the source list

Command

sudo apt-get update

2.4 Install Java

Command

sudo apt-get install oracle-java8-installer

3. Configure SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it.

3.1 Install Open SSH Server-Client

Command

sudo apt-get install openssh-server openssh-client

3.2 Generate KeyPairs

Command

ssh-keygen -t rsa -P ""

3.3 Configure password-less SSH

3.3.1 Copy the generated ssh key to master node’s authorized keys.

Command

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3.3.2 Copy the master node’s ssh key to slave’s authorized keys.

Command

ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave01
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ashok@slave02

3.4 Check by SSH to all the Slaves

Command

ssh slave01
ssh slave02

2. Install Hadoop

1. Download Hadoop

http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.3.2.tar.gz

Note

You can download any version of hadoop version 2+. Here I am using CDH version is Cloudera’s 100% open source platform distribution.

2. Untar Tar ball

Command

tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz

3. Hadoop multi-node cluster setup Configuration

1. Edit .bashrc

Edit .bashrc file located in user’s home directory and add following parameters.

Command

vi .bashrc

export HADOOP_PREFIX="/home/ashok/hadoop-2.5.0-cdh5.3.2"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}

Note

After above step restart the terminal, so that all the environment variables will come into effect or execute the source command.

Command

source .bashrc

2. Edit hadoop-env.sh

hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc. Edit configuration file hadoop-env.sh (located in HADOOP_HOME/etc/hadoop) and set JAVA_HOME.

Command

vi hadoop–env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

3. Edit core-site.xml

core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.

Edit configuration file core-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries.

Command

vi core-site.xml

<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master:9000</value>
   </property>
   <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/ashok/hdata</value>
   </property>
</configuration>

Note

Here /home/ashok/hdata is a sample location; please specify a location where you have Read Write privileges.

4. Edit hdfs-site.xml

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.

Edit configuration file hdfs-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries

Command

vi hdfs-site.xml

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
</configuration>

5. Edit mapred-site.xml

mapred-site.xml contains configuration settings of MapReduce application like number of JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml file using mapred-site.xml template. Edit configuration file mapred-site.xml (located in HADOOP_HOME/ etc/hadoop) and add following entries

Command

vi mapred-site.xml

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

6. Edit yarn-site.xml

yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.Edit configuration file mapred-site.xml (located in HADOOP_HOME/etc/hadoop) and add following entries

Command

vi yarn-site.xml

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
   <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8025</value>
   </property>
   <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8030</value>
   </property>
   <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8040</value>
   </property>
</configuration>

7. Edit salves

Edit configuration file slaves (located in HADOOP_HOME/etc/hadoop) and add following entries:

slave01
slave02

Now Hadoop is set up on Master, now setup Hadoop on all the Slaves.

Install Hadoop On Slaves

1. Setup Prerequisites on all the slaves

Run following steps on all the slaves

1. Add Entries in hosts file
2. Install Java 8 (Recommended Oracle Java)

2. Copy configured setups from master to all the slaves

2.1. Create tarball of configured setup

Command

tar czf hadoop.tar.gz hadoop-2.5.0-cdh5.3.2

Note

Run this command on Master

2.2. Copy the configured tarball on all the slaves

Command

scp hadoop.tar.gz slave01:~
scp hadoop.tar.gz slave02:~

Note

Run this command on Master

2.3. Un-tar configured Hadoop setup on all the slaves

Command

tar xvzf hadoop.tar.gz

Note

Run this command on all slaves.

Now Hadoop is set up on all the Slaves. Now Start the Cluster.

4. Start the Hadoop Cluster

Let us now learn how to start Hadoop cluster?

4.1. Format the name node

Command

bin/hdfs namenode -format

Note

Run this command on Master
This activity should be done once when you install Hadoop, else it will delete all the data from HDFS

4.2. Start HDFS Services

Command

sbin/start-dfs.sh

Note

Run this command on Master

4.3. Start YARN Services

Command

sbin/start-yarn.sh

Note

Run this command on Master

4.4. Check for Hadoop services

4.4.1. Check daemons on Master

Command

jps

NameNode
ResourceManager

4.4.2. Check daemons on Slaves

Command

jps

DataNode
NodeManager

5. Stop The Hadoop Cluster

Let us now see how to stop the Hadoop cluster?

5.1. Stop YARN Services

Command

sbin/stop-yarn.sh

Note

Run this command on Master

5.2. Stop HDFS Services

Command

sbin/stop-dfs.sh

Note

Run this command on Master

Multi Node Cluster Setup