Computer Science 221: Information Retrieval

Winter 2009-2010

Department of Informatics

Donald Bren School of Information and Computer Sciences

Home | Administrative Policies | Course Structure | Materials | Assignment Schedule

Assignment 04

  1. Goals:
    1. This assignment is designed to:
      1. Teach you how to make a postings list with a distributed MapReduce architecture on a BIG cluster.
      2. Capture the data that you need to calculate ranking scores for assignment 05.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, no restrictions on membership.
    2. Mitch will be managing the cluster on Amazon for us.
      1. This is going to require some instructions on how to upload and run jobs.
      2. Some details.
  3. Write a Java Program To Be Executed by a >50 node Hadoop cluster
    1. The input to the system will be a list of <key,value> pairs in a file.
      1. The "key" will be a URL in Wikipedia
      2. The "value" will be the document ID of that URL
    2. The posting list will be on the distributed file system.
    3. There is no guarantee that the URLs are valid.
  4. What to turn in:
    1. Submit a sorted postings.txt file in which each line contains postings of a term:
      1. A term is defined as follows:
        1. Don't consider text in the HTML tags.
        2. Replace all non-ASCII characters with spaces.
        3. Make all the characters lowercase.
        4. All terms are separated by white space.
        5. All punctuation should be removed from the beginning and end of a term, but not from the middle.
        6. No term should be longer than 32 characters. If a term is longer, truncate it to the first 32 characters.
        7. Examples:
          1. "However", "however", "however," should all map to "however"
          2. "Dogs", "dogs", and "dOgS" should all map to "dogs"
          3. "Dog's", "dog's", and "dog'S" should all map to "dog's"
          4. "jack-o-lantern." should map to "jack-o-lantern"
          5. "他⃞四⃞六⃞40ㅈ20ㅁ" should map to two terms "40" and "20"
          6. "1>alphabet" should map to "1>alphabet" (this is dumb, but for uniformity, we'll stick to the format)
          7. "010101010101010--010011101010101actgtgtacgatcgtagctggtagctcgtagcta--010100" should map to "010101010101010--010011101010101"
      2. The form of the posting list should be:
        1. [term, cf, df] [docid1, tf1] [docid2, tf2] ....
          1. where cf is the collection frequency of the term, df is the document frequency of term
          2. tf1 is the term frequency of "term" in document with docid1 and so on.
      3. Before uploading the posting list, truncate it and only keep the LAST 1000 lines.
    2. Submit a second report with this content:
      1. Total number of distinct terms.
      2. List of top 10 words with lowest document frequency.
      3. List of top 10 words with highest document frequency.
      4. First Name of member 1 of your team: ...
      5. List of (at most 1000) URLs containing first name of member 1 of your team: ...
        1. If there are no pages with your team members name, use "Donald" instead and tell me you did that.
  5. Submitting your assignment
    1. Turn your report into the dropbox created for this assignment on EEE.
    2. Please make two pdf documents
      1. Make the file names:
        1. <StudentID>-<StudentIID>-Assignment04-Report.pdf
        2. <StudentID>-<StudentIID>-Assignment04-Posting.pdf
    3. Please include the full names of all of your group members as appropriate in the document