Computer Science 221: Information Retrieval

Winter 2009-2010

Department of Informatics

Donald Bren School of Information and Computer Sciences

Home | Administrative Policies | Course Structure | Materials | Assignment Schedule

Assignment 05

  1. Goals:
    1. This assignment is designed to have you:
      1. Use a large posting list to rank web pages against a query.
      2. Implement cosine rank scoring efficiently.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, no restrictions on membership.
  3. Write a Java Program to process the Hadoop Output
    1. This part is necessary so that your program runs at a reasonable speed.
    2. The input to this system is the output of the posting list exercise from Assignment 04.
    3. The output of this program is two binary files.
      1. The first
        1. is a table in which each row has a term and a pointer into the second file.
      2. The second file
        1. Is a binary representation of the posting list that you can use the first table to do a random access lookup into.
  4. Write a Java Program Score a query
    1. This part calculates the cosine ranking score
    2. The input to this part is the output of the previous program, plus the( document id -> url )table, plus a query from a user.
    3. The output is a ranked list of the ten most relevant web pages in wikipedia.
      1. Do not create an accumulator for any term which occurs more than 50,000 times.
      2. With an efficient implementation, your program should return results in a fraction of second. Therefore, 15 seconds is the expectation for a maximum time to wait for a query response. If your program takes longer than that, something is not working right.
      3. You can ignore the normalization for the query if you want (this is a common optimization)
  5. Extra Credit
    1. Create a web-based user interface which collects a query from a user and displays the results.
      1. Either a web-page or a browser extension (harder)
  6. What to turn in:
    1. A sources.zip file containing your source code.
    2. A report.pdf file containing this information:
      1. Full names of your team members.
      2. Size of each on-disk data structure that you are using. For example, if you are using 3 binary files to store your data structures this would be the description of each file and its size in Megabytes.
      3. Approximate size of your main in-memory data structures. For example, if you are keeping a look up table in memory, you should report the size (in MB) and description of this data structure.
      4. A sample query that your program can process fast and the amount of time it takes to respond.
      5. A sample query that takes longer to process (compared to other typical queries) and the amount of time it takes to response.
  7. Submitting your assignment
    1. Turn your report into the dropbox created for this assignment on EEE.
    2. Please make two documents
      1. Make the file names:
        1. <StudentID>-<StudentIID>-Assignment05-Sources.zip
        2. <StudentID>-<StudentIID>-Assignment05-Report.pdf