Saturday, December 1, 2012

Getting Started with Hadoop on Ubuntu

So, I'm trying to play with hadoop again (haven't done it in a while), and since ubuntu is my current weapon of choice, I found a great tutorial at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ ,but I wanted something even simpler, like a script, plus a sample program and instructions on how to compile and run it, so I created it (at the very bottom are the differences with Noll's tutorial). It is available at: https://github.com/okaram/scripts/blob/master/hadoop/install.sh .

You just need to download it and change it so it can be executable:
wget https://raw.github.com/okaram/scripts/master/hadoop/install-hadoop-1.1.sh
chmod a+x install-hadoop-1.1.sh
view raw wget_install.sh hosted with ❤ by GitHub
You probably want to look at it in your favorite editor (it is not a good idea to just run a script from the internet; I trust myself, but you shouldn't trust me), and you may want to change the mirror while you're at it (I live in Atlanta, so I use Georgia Tech's). After you're happy, run it as root:
sudo su -c ./install-hadoop-1.1.sh
view raw sudo.sh hosted with ❤ by GitHub
And you should be done with the installation ! the script creates a user for hadoop, called hduser; you can change to it, by typing:
sudo su - hduser
view raw sudo_hduser.sh hosted with ❤ by GitHub
Then, as that user, you want to setup your path and classpath (the classpath is needed for compiling):
PATH=$PATH:/usr/local/hadoop/bin
export CLASSPATH=/usr/local/hadoop/hadoop-core-1.1.0.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar
view raw set_path.sh hosted with ❤ by GitHub
And start hadoop:
start-all.sh
view raw start-all.sh hosted with ❤ by GitHub
Now download my sample program (it is the standard WordCount example, from the tutorial, but without the package statement, so you can compile it directly from that folder), compile it and create a jar file:
wget https://raw.github.com/okaram/scripts/master/hadoop/samples/WordCount.java
javac WordCount.java
jar -cvf wc.jar *.class
Now, we need to put some data into hadoop; first we create a folder and copy a file into it (our same WordCount.java, since we just need a text file):
mkdir texts
cp WordCount.java texts/
view raw make_folders.sh hosted with ❤ by GitHub
And we copy that folder into hadoop (and list it, to verify it's there):
hadoop dfs -copyFromLocal texts /user/hduser/texts
hadoop dfs -ls /user/hduser/texts
And now we can run our program in hadoop:
hadoop jar wc.jar WordCount /user/hduser/texts /user/hduser/out
view raw run_it.sh hosted with ❤ by GitHub
When you want to stop hadoop, just run the stop-all.sh command; also, if you want to copy the output to your file system, just use the -copyToLocal option of hadoop's dfs.
The install script is completely automated, so you can even use it to start an amazon ec2 instance with it; for example, use:
ec2-run-instances ami-51de5e38 -t t1.micro -k mac-vm -f install-hadoop-1.1.sh
view raw ec2.sh hosted with ❤ by GitHub
to start a micro instance, with a ubuntu 12.04 daily build (for Dec-1-2012; change the ami id to get a different one :), and a key named mac-vm.

1 comment:

  1. it's a nice project, very helpful for us and thank's for sharing. we are providing Hadoop online training


    ReplyDelete