Informatics 141: Information Retrieval:

Assignment 06

Winter 2009

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 3/2/2009

  1. Goals:
    1. This assignment is designed to:
      1. Teach you how to make a postings list with a distributed MapReduce architecture on a HUGE cluster.
      2. Capture the data that you need to calculate ranking scores for assignment 07.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, no restrictions on membership.
  3. Write a Java Program To Be Executed by a 38 node Hadoop cluster
    1. The input to the system will be a list of <key,value> pairs in a file.
      1. The "key" will be a URL in Wikipedia
      2. The "value" will be the document ID of that URL
  4. Write a Java Program To Be Post-Process the results (optional for this assignment)
    1. This program will calculate the statistics required for the document that you turn in.
  5. How to do this:
    1. Look at this class wiki for help (feel free to edit)
    2. Look at Assignment 5 resources if necessary
  6. Test
    1. With the input on the dfs at /assignment06/input10
    2. Don got this output.
  7. What to turn in:
    1. Submit a postings.txt file in which each line contains postings of a term. All terms should be in lowercase and in the content (not the tags) of the web page.
      1. <term, cf, df> <docid1, tf1> <docid2, tf2> ....
        1. where cf is the collection frequency of the term, df is the document frequency of term (Number of documents containing this term).
        2. tf1 is the term frequency of "term" in document with docid1 and so on.
      2. Before uploading this file, truncate it and only keep the LAST 1000 lines.
    2. Submit a report.txt file with this content:
      1. Total number of distinct terms.
      2. List of top 10 words with lowest document frequency.
      3. List of top 10 words with highest document frequency.
      4. First Name of member 1 of your team: ...
      5. List of (at most 1000) URLs containing first name of member 1 of your team: ...
  8. Submitting your assignment
    1. We are going to use checkmate.ics.uci.edu to submit this assignment.