The Programmers Book

Apache Hadoop 3.x installation on Ubuntu (multi node cluster).


This documents explains step by step Apache Hadoop installation version (hadoop 3.1.1) with master node (namenode) and 3 worker nodes (datanodes) cluster on Ubuntu.

Below are the 4 nodes and it’s IP addresses I will be referring here.

192.168.1.100      namenode
192.168.1.141      datanode1
192.168.1.113      datanode2
192.168.1.118      datanode3

And, my login user is “ubuntu”

1. Apache Hadoop Installation

  1. Update the source list of ubtuntu
sudo apt-get update

2. Install SSH

sudo apt-get install ssh

3. Setup password less login between all namenode and datanodes in cluster.

The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

ssh-keygen command creates below files.

ubuntu@namenode:~$ ls -lrt .ssh/
-rw-r–r– 1 ubuntu ubuntu 397 Dec 9 00:17 id_rsa.pub
-rw——- 1 ubuntu ubuntu 1679 Dec 9 00:17 id_rsa

Copy id_rsa.pub to authorized_keys under ~/.ssh folder.

cat id_rsa.pub >> ~/.ssh/authorized_keys

Copy authorized_keys to all data nodes.

scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys

4. Add all our nodes to /etc/hosts.

sudo vi /etc/hosts

192.168.1.100 namenode.socal.rr.com
192.168.1.141 datanode1.socal.rr.com
192.168.1.113 datanode2.socal.rr.com
192.168.1.118 datanode3.socal.rr.com

5. Install JDK1.8 on all 4 nodes

sudo apt-get -y install openjdk-8-jdk-headless

Post JDK install, check if it installed successfully by running “java -version”

6. Apache Hadoop installation version 3.1.1 on all 4 nodes

Download Hadoop latest version using wget command.

wget http://apache.cs.utah.edu/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

Once your download is complete, unzip the file’s contents using tar, a file archiving tool for Ubuntu and rename the folder to hadoop

tar -xzf hadoop-3.1.1.tar.gz
mv hadoop-3.1.1 hadoop

7. Apache Hadoop configuration – Setup environment variables.

Add hadoop environment variables to .bashrc file. open file in vi editor and add below variables.

vi ~/.bashrc

export HADOOP_HOME=”/home/ubuntu/hadoop”
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}

Now load the environment variables to the opened session

source ~/.bashrc

2. Configuring hadoop master node and all worker nodes

Make below configurations on namenode and on all 3 data nodes.

  1. Update hadoop-env.sh

edit ~/hadoop/etc/hadoop/hadoop-env.sh file and add the JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. Update core-site.xml

edit ~/hadoop/etc/hadoop/core-site.xml


    
        fs.defaultFS
        hdfs://192.168.1.100:9000
    

3. Update hdfs-site.xml

edit ~/hadoop/etc/hadoop/hdfs-site.xml


    
        dfs.replication
        3
    
    
        dfs.namenode.name.dir
        file:///usr/local/hadoop/hdfs/data
    
    
        dfs.datanode.data.dir
        file:///usr/local/hadoop/hdfs/data
    

4. Update yarn-site.xml

edit ~/hadoop/etc/hadoop/yarn-site.xml


    
        yarn.nodemanager.aux-services
        mapreduce_shuffle
    
    
        yarn.nodemanager.aux-services.mapreduce.shuffle.class
        org.apache.hadoop.mapred.ShuffleHandler
    
    
       yarn.resourcemanager.hostname
       192.168.1.100
    

5. Update mapred-site.xml 

edit ~/hadoop/etc/hadoop/mapred-site.xml

[Note: This configuration required only on namenode hoever, it will not harm if you configure it on datanodes]


    
        mapreduce.jobtracker.address
        192.168.1.100:54311
    
    
        mapreduce.framework.name
        yarn
    


6. Create data folder

create data folder and change it’s permissions to login user. I’ve logged in as ubuntu user, so you see with ubuntu.

sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown ubuntu:ubuntu -R /usr/local/hadoop/hdfs/data
chmod 700 /usr/local/hadoop/hdfs/data

3. Create master and workers files

  1. Create master file

The file masters is used by startup scripts to identify the namenode. so edit ~/hadoop/etc/hadoop/masters and add your namenode IP.

192.168.1.100

2. Create workers file

The file workers is used by startup scripts to identify datanodes. edit ~/hadoop/etc/hadoop/workers and add all your datanode IP’s

192.168.1.141
192.168.1.113
192.168.1.118

This completes Apache Hadoop installation and Hadoop configuration.

3. Format HDFS and start cluster

  1. Format HDFS

HDFS needs to be formatted like any classical file system. On node-master, run the following command:

hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

2. Start cluster

Start the HDFS by running the following script from namenode

start-dfs.sh

You should see the following lines

ubuntu@namenode:~$ start-dfs.sh
Starting namenodes on [namenode.socal.rr.com]
Starting datanodes
Starting secondary namenodes [namenode]
ubuntu@namenode:~$

jps on namenode should list the following

ubuntu@namenode:~$ jps
18978 SecondaryNameNode
19092 Jps
18686 NameNode

jps on datanodes should list the following

ubuntu@datanode1:~$ jps
14012 Jps
11242 DataNode

And by accessing http://192.168.1.100:9870 you should see the following namenode web UI

Apache Hadoop Installation
Apache Hadoop Installation

3. Test by uploading a file to hdfs

Writing and reading to HDFS is done with command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu is my loggedin user. If you logon with different user then please use your userid instead of ubuntu)

hdfs dfs -mkdir -p /user/ubuntu/

Get a books file from the Gutenberg project

wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt

upload downloaded file to hdfs using -put

hdfs dfs -mkdir books
hdfs dfs -put books/alice.txt

List a file on hdfs

hdfs dfs -ls

There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation

4. Stopping cluster

 stop-dfs.sh

You should see the below output.

Stopping namenodes on [namenode.socal.rr.com]
Stopping datanodes
Stopping secondary namenodes [namenode]

References

https://hadoop.apache.org/

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *