Informatics 141: Information Retrieval:

Assignment 05

Winter 2009

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 2/13/2009

  1. Goals:
    1. This assignment is designed to:
      1. To teach you how to make a postings list with a MapReduce architecture.
      2. To teach you how to use Hadoop before trying it on a Wikipedia scale assignment.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, but it cannot be the same group as a previous assignment.
  3. Write a Java Program To Be Executed by Hadoop
    1. The input to the system will be a list of <key,value> pairs in a file.
      1. The "key" will be a URL
      2. The "value" will be the document ID of that URL
      3. Here is the input file.
    2. The output of the system will be a posting list ordered by terms found in the input URLS
      1. A text file with the term on the start of the line and a list of all document URLs on which that term appears.
  4. Write a Java Program To Be Post Process the results (optional for this assignment)
    1. This program will calculate the statistics required for the document that you turn in.
  5. How to do this:
    1. Refer to the notes from the discussion section on 2/9/09 for additional info.
    2. Look at this class wiki for help (feel free to edit)
    3. Here are some tutorials
      1. Follow the instructions here ("Picking your java version on openlab") to get the right version of java.
      2. "Hadoop Map/Reduce Tutorial"
      3. "Running Hadoop On Ubuntu Linux (Single-Node Cluster)"
    4. Hints:
      1. The lines of input.txt file are passed to your mapper function one by one.
      2. It should download the page of that URL, make all the text lower cased, and generate <term, docid> pairs. (Check Downloader.java as a sample of how to use crawler4j lib to download a single page). Here is an example for WordCount.
  6. Test
    1. With this input.
    2. Don got this output.
  7. What to turn in:
    1. Outputs:
      1. Submit a postings.txt file in which each line contains postings of a term: <term, docid1, docid2, ...>. Before uploading this file, truncate it and only keep the first 500 lines.
      2. Submit a report.txt file with this content:
        1. Total number of distinct terms: ....
        2. Number of documents containing term 'is': ...
        3. Number of documents containing term 'Satellite': ...
        4. First Name of member 1 of your team: ...
        5. List of URLs containing first name of member 1 of your team: ...
  8. Submitting your assignment
    1. We are going to use checkmate.ics.uci.edu to submit this assignment.
    2. Make the file name <StudentID>-<StudentIID>-<StudentID>-Assignment05.pdf