Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
1) A machine with Ubuntu 14.04 LTS operating system installed.
2) Apache Hadoop 2.6.4 Software (Download Here)
By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used.
Step 1 – Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.
$ sudo apt-get update
Step 2 – Installing Java 7.
$ sudo apt-get install openjdk-7-jdk
Step 3 – Install open-ssh server. It is a cryptographic network protocol for operating network services securely over an unsecured network. The best known example application is for remote login to computer systems by users.
$ sudo apt-get install openssh-server
Step 4 – Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here ‘hadoop’ is a group name and ‘hduser’ is a user of the group.
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Step 5 – Configure the sudo permissions for ‘hduser’.
$ sudo visudo
Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.
Add the permissions to sudoers.
hduser ALL=(ALL) ALL
Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.
Step 6 – Creating hadoop directory.
$ sudo mkdir /usr/local/hadoop
Step 7 – Change the ownership and permissions of the directory /usr/local/hadoop. Here ‘hduser’ is an Ubuntu username.
$ sudo chown -R hduser /usr/local/hadoop
$ sudo chmod -R 755 /usr/local/hadoop
Step 8 – Switch User, is used by a computer user to execute commands with the privileges of another user account.
$ su hduser
Step 9 – Change the directory to /home/hduser/Desktop , In my case the downloaded hadoop-2.6.4.tar.gz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.
$ cd /home/hduser/Desktop/
Step 10 – Untar the hadoop-2.6.4.tar.gz file.
$ tar xzf hadoop-2.6.4.tar.gz
Step 11 – Move the contents of hadoop-2.6.4 folder to /usr/local/hadoop
$ mv hadoop-2.6.4/* /usr/local/hadoop
Step 12 – Edit $HOME/.bashrc file by adding the java and hadoop path.
$ sudo gedit $HOME/.bashrc
$HOME/.bashrc file. Add the following lines
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
Step 13 – Reload your changed $HOME/.bashrc settings
$ source $HOME/.bashrc
Step 14 – Verify hadoop installation. It just display hadoop version in the terminal.
$ hadoop version
Execution of WordCount Example
The following example copies the .txt files of the /usr/local/hadoop/ directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
Step 1 – Creating input directory.
$ mkdir /home/hduser/Desktop/input
Step 2 – Copy all text files. From $HADOOP_HOME to /home/hduser/Desktop/input
$ cp $HADOOP_HOME/*.txt /home/hduser/Desktop/input
Step 3 – Verify copy.
$ ls -l /home/hduser/Desktop/input
Step 4 – Submit jar file to run. Sample WordCount example jar is in $HADOOP_HOME/share/hadoop/mapreduce/ folder.
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar wordcount /home/hduser/Desktop/input /home/hduser/Desktop/ouput
Step 5 – Verify output.
$ cat /home/hduser/Desktop/output/*