Here are the steps for with setting up a Hadoop cluster and pluging-in a windows machine as a slave-
a. First setup a psuedo-hadoop on a linux machine as explained in http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
I was able to use this excellent tutorial with minor changes to get psuedo-hadoop cluster running on a Centos/Ubuntu and a Windows machine.
I used a common user hadoop created at all machines
b. Next step was to get all the psuedo-machines work together as a real cluster. Again http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) was a easy reckoner to get it working
Some easy tips to get hadoop working in cluster mode are
- Use machine names everywhere instead of IP address and change /etc/hosts at all machines
- Configure the setup at the master machine i.e. the conf xml including masters and slave files as well as /etc/hosts and copy all these conf files and entries in /etc/hosts file to the slave nodes
- The same copying thingy helps for authorized_keys file where we enter all public keys from each slave to master machine’s authorized_keys and then copying this authorized_keys file to all slaves.
- set JAVA_HOME is each installations hadoop-config.sh file. I had some issues with setting it in .profile and still getting some JAVA_HOME problems
- An other easy option is to create a gzip of your master hadoop install and copy it for setup in slave nodes
c. So now for the windows bit-
- install cygwin if already not done that
- check if you have sshd server installed in cygwin setup- if not, install it
- Double check if you a service CYGWIN sshd running under windows services
- create a hadoop user by-
cygwin> net user hadoop password /add /yes cygwin> mkpasswd -l -u hadoop >> /etc/passwd cygwin> chown hadoop -R /home/hadoop
d. Treat windows machine as *nix-
- Now use Putty to login to your local windows machine using the newly created hadoop user
- Setup hadoop as you would do for any Linux machine- easy option is to copy paster master hadoop installation
- Do not forget to setup .ssh files and copying the pub key in master authorized_key file and copying back that authorized_key to this windows machine. Also do add JAVA_HOME in hadoop-config.sh file which should be a /cygdrive/<path to java6> entry
e. Assuming that the master server is already running, run this slave using "bin/hadoop-daemon.sh start datanode" or "bin/hadoop-daemon.sh start tasktracker" to run datanode or task tracker instances.
Next, will write about how I managed to get Hive-0.30 release working with Hadoop 0.20.0 on my small Hadoop cluster with 3 Linux machines and 1 windows machine