Hadoop: First Steps

I think you’ll be interested in this link.


“The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

I know, that’s a little bit dense. Let’s see if we can’t break it down a little bit.

The first thing to know is that Apache Hadoop is program designed to aid in the use of distributed computing. Unsurprisingly, it is published by the Apache Software Foundation. It was developed by a very google-able man named Doug Cutting.

Hadoop allows a multitude of computers to be used within a network to accomplish large amounts data processing within a short period of time.

Let me give you an example.

Years ago I worked for the Department of Physics and Astronomy at the University of KY.  One of the goals of the department was to aid the SETI project.  SETI stands for Search For Extraterrestrial Intelligence.  Those who have seen the movie Contact (or the excellent novel by Carl Sagan that it is based off of)  may be familiar with the notion that the search for intelligent life outside of Earth involves the sifting through of enormous, gigantic amounts of noise in the search for patterns.  The SETI program, never a favorite in Washington, could not afford the immense computing power required to process the signals it received from places like the VLA.

Because of this, distributed computing programs were developed that would utilize the combined processing power of many computers all over the world.  Any time a computer that was part of the distribution network entered a period of user inactivity it would take that time to download a data packet from the SETI headquarters and process the packet for patterns that would indicate some manner of regularity.  It would then send its findings back to a host computer.

Now, the program used back then was called SETI@Home and it was rather primitive compared to what is available now.  Hadoop does not require the computers within the network to be idle, and it doesn’t necessarily rely on single packets of information to be downloaded to each computer.  It’s much more flexible than that. Read more about how exactly Hadoop works at the official Apache site:


While you’re on the site be sure to have a look at the companies featured in the “Who uses Hadoop?” section.  Some very big names in there.  Amazon, Bing, Yahoo, Facebook.  The list goes on.

Next time we’ll have a look at Google’s MapReduce in relation to Hadoop, and what sort of projects that you might be able to undertake with them.

This entry was posted in Storage, Web Hosting, Zettabyte Systems and tagged , , . Bookmark the permalink.