Computer Science 221: Information Retrieval

Winter 2009-2010

Department of Informatics

Donald Bren School of Information and Computer Sciences

Home | Administrative Policies | Course Structure | Materials | Assignment Schedule

Assignment 04 Details

  1. It appears that using family-guy instead of openlab is a better choice.
    1. Both are load balanced clusters. So logging into family-guy.ics.uci.edu will put you on a low-load machine.
  2. Setting up hadoop:
    1. You will need to download and install Hadoop 0.20.1 ( You should have done this for the last assignment, if you already have 0.20.1, you can just use that install)
    2. You will need to go to your conf/ folder inside the hadoop folder. You will replacing 2 files (core-info.xml, and mapred-info.xml. If you want to make a backup, do this now)
    3. Make sure java is installed as per here (if you are using ics machines)
    4. You will download 2 files:
      http://www.mitchdempsey.com/cs221/core-site.xml
      http://www.mitchdempsey.com/cs221/mapred-site.xml
  3. If you are on the ICS cluster, you can do the following commands from inside the conf/ folder:
    1. module load wget
    2. wget -O core-site.xml http://www.mitchdempsey.com/cs221/core-site.xml
    3. wget -O mapred-site.xml http://www.mitchdempsey.com/cs221/mapred-site.xml
  4. Now go back up to the hadoop root folder.
  5. You will need to download the keyfile required to connect:
    1. wget http://www.mitchdempsey.com/cs221/authkey
  6. Once you have downloaded that, you can now connect to the compute cluster
  7. Execute this command from the hadoop root folder. (It might sit for a second, and then it should disappear without any errors)
    1. ssh -D 2600 -f -N -i authkey cs221@ec2-184-73-1-20.compute-1.amazonaws.com
    2. Answer yes if it gives you a fingerprint and asks if you want to continue (this is a first time connection issue)
    3. (If you get any errors about port usage or something, you might need to try another node in the ICS cluster)
  8. Now, you should be able to connect to the cluster. You can try by doing the following:
    1. bin/hadoop dfsadmin -report
    2. (If this command returns a list of nodes with some information about the file system, then you are good to go.)
  9. You should be able to perform exactly the same operations that you did on your standalone cluster if you like. Please be aware of the filesystem (and do not delete files that are not yours)
  10. LINKS:
    1. JobTracker - you can view stats and stuff on your jobs.
      http://ec2-184-73-1-20.compute-1.amazonaws.com:50030/jobtracker.jsp
    2. DFS Admin - You wont be able to browse, but you can see the status.
      http://ec2-184-73-1-20.compute-1.amazonaws.com:50070/dfshealth.jsp