Informatics 141/ Computer Science 121: Information Retrieval

Assignment 06

Winter 2008

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 03/03/2008

  1. Build a postings list
    1. Create a postings list that will work for the cosine similarity algorithm.
    2. The input is the output from your crawl of Wikipedia
      1. One file of (document, term) pairs
      2. One file of (document, imageUrl) pairs
    3. The first output is a postings list with meta data
      1. The head of every list should have the term for that list, plus the document frequency for that term
      2. Every entry in the list should be a document that has that term, plus the term frequency of that term in that document
    4. The second output is a simple lookup table of documents and all the images in that document
  2. The challenge
    1. This assignment is hard because of the scale. Use as much data as you can.
  3. Resources
    1. If you didn't get your crawler to work, then you may build your posting list from this common set of files:
      1. Here is a sample to get you going.
        1. Sample crawl
        2. Data courtesy of Kyle S. (henceforth the crawl-czar)
        3. Format (document,term) pairs
          1. ##NEWURL:<DocumentURL1>
          2. <Term1>:<CountInDocumentURL1>
          3. <Term2>:<CountInDocumentURL1>
          4. <Term3>:<CountInDocumentURL1>
          5. ##NEWURL:<DocumentURL2>
          6. <Term1>:<CountInDocumentURL2>
          7. <Term2>:<CountInDocumentURL2>
          8. <Term3>:<CountInDocumentURL2>
          9. etc...
        4. Format (document,image) pairs
          1. ##NEWURL:<DocumentURL1>
          2. <ImageURL1>
          3. <ImageURL2>
          4. etc...
      2. Full data set (about 70,000 wikipedia pages)
        1. Full dataset : djpatter_terms.zip djpatter_terms2.zip djpatter_output.zip
  4. Evaluation
    1. In person (Do this with a number of examples) (85%)
      1. Precision:
        1. If we give you a word, can you give us a list of documents that has that word.
        2. If we give you a document, can you giv us an image in that document.
      2. Recall:
        1. If we give you a word, can you give us a document that we know has that word in it.
        2. If we give you a document, can you give us an image that we know is in the document.
    2. Were you able to use your own crawl data to do this or did you use the common data? (15% for your own crawl)
  5. This is optionally a group project for groups of 2 only