Computer Science 221: Information Retrieval

Winter 2009-2010

Department of Informatics

Donald Bren School of Information and Computer Sciences

Home | Administrative Policies | Course Structure | Materials | Assignment Schedule

Assignment 03

  1. Goals
    1. This assignment is designed to:
      1. To teach you how to make a postings list with a MapReduce architecture.
      2. To teach you how to use Hadoop before trying it on a Wikipedia scale assignment.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, but it cannot be the same group as a previous assignment.
  3. Write a Java Program To Be Executed by Hadoop
    1. The input to the system will be a list of URL and DocID pairs in a file.
      1. Here is the input file.
    2. The output of the system will be a posting list alphabetized by terms found in the input URLS
      1. A text file with the term on the start of the line and a list of all document URLs on which that term appears.
  4. Write a Java Program To Be Post Process the results
    1. This program will calculate the statistics required for the document that you turn in.
  5. How to do this:
    1. Download Hadoop
    2. Install it locally on an openlab machine
    3. Run it as a single-node cluster
    4. Here are some tutorials
      1. Follow the instructions here ("Picking your java version on openlab") to get the right version of java.
      2. "Hadoop Map/Reduce Tutorial"
      3. "Running Hadoop On Ubuntu Linux (Single-Node Cluster)"
    5. Hints:
      1. The lines of input.txt file are passed to your mapper function one by one.
      2. It should download the page of that URL, make all the text lower cased, and generate <term, docid> pairs.
        1. (Check Downloader.java as a sample of how to use crawler4j lib to download a single page).
        2. Here is an example for WordCount.
  6. Test
    1. With this input.
    2. Don got this output.
  7. What to turn in:
    1. Outputs:
      1. Submit a postings.txt file in which each line contains postings of a term: <term, docid1, docid2, ...>. Before uploading this file, truncate it and only keep the first 500 lines.
      2. Submit a report.txt file with this content:
        1. Total number of distinct terms: ....
        2. Number of documents containing term 'is': ...
        3. Number of documents containing term 'satellite': ...
        4. Number of documents containing the first name of member 1 of your team: ...
        5. List of URLs containing first name of member 1 of your team: ...
  8. Submitting your assignment
    1. Turn your report into the dropbox created for this assignment on EEE.
    2. Make the file name <StudentID>-<StudentIID>-Assignment03.pdf
    3. Please include the full names of all of your group members as appropriate in the document