Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system
for processing large datasets across a cluster of commodity servers. It provides a simple programming
framework for large-scale data processing using the resources available across a cluster of computers.
Hadoop is inspired by a system invented at Google to create inverted index for its search product. Jeffrey
Dean and Sanjay Ghemawat published papers in 2004 describing the system that they created for Google.
The first one, titled “MapReduce: Simplified Data Processing on Large Clusters” is available at research.
google.com/archive/mapreduce.html. The second one, titled “The Google File System” is available at
research.google.com/archive/gfs.html. Inspired by these papers, Doug Cutting and Mike Cafarella
developed an open source implementation, which later became Hadoop.
Many organizations have replaced expensive proprietary commercial products with Hadoop for
processing large datasets. One reason is cost. Hadoop is open source and runs on a cluster of commodity
hardware. You can scale it easily by adding cheap servers. High availability and fault tolerance are provided
by Hadoop, so you don’t need to buy expensive hardware. Second, it is better suited for certain types of data
processing tasks, such as batch processing and ETL (extract transform load) of large-scale data.
Hadoop is built on a few important ideas. First, it is cheaper to use a cluster of commodity servers for
both storing and processing large amounts of data than using high-end powerful servers. In other words,
Hadoop uses scale-out architecture instead of scale-up architecture.
Second, implementing fault tolerance through software is cheaper than implementing it in hardware.
Fault-tolerant servers are expensive. Hadoop does not rely on fault-tolerant servers. It assumes that servers
will fail and transparently handles server failures. An application developer need not worry about handling
hardware failures. Those messy details can be left for Hadoop to handle.
Third, moving code from one computer to another over a network is a lot more efficient and faster than
moving a large dataset across the same network. For example, assume you have a cluster of 100 computers
with a terabyte of data on each computer. One option for processing this data would be to move it to a very
powerful server that can process 100 terabytes of data. However, moving 100 terabytes of data will take a
long time, even on a very fast network. In addition, you will need very expensive hardware to process data
with this approach. Another option is to move the code that processes this data to each computer in your
100-node cluster; it is a lot faster and more efficient than the first option. Moreover, you don’t need high-end
servers, which are expensive.
Fourth, writing a distributed application can be made easy by separating core data processing logic
from distributed computing logic. Developing an application that takes advantage of resources available on
a cluster of computers is a lot harder than developing an application that runs on a single computer. The
pool of developers who can write applications that run on a single machine is several magnitudes larger
than those who can write distributed applications. Hadoop provides a framework that hides the complexities
of writing distributed applications. It thus allows organizations to tap into a much bigger pool of application
Although people talk about Hadoop as a single product, it is not really a single product. It consists of
three key components: a cluster manager, a distributed compute engine, and a distributed file system