Computer Science 221: Information Retrieval
Winter 2009-2010
Department of Informatics
Donald Bren School of Information and Computer Sciences
Assignment 05
- Goals:
- This assignment is
designed to have you:
- Use a large posting list to rank web pages against a query.
- Implement cosine rank scoring efficiently.
- Administration:
- You may work in teams of 1, 2 or 3, no restrictions on membership.
- Write a Java Program to process the Hadoop Output
- This part is necessary so that your program runs at a reasonable speed.
- The input to this system is the output of the posting list exercise from Assignment 04.
- The output of this program is two binary files.
- The first
- is a table in which each row has a term and a pointer into the second file.
- The second file
- Is a binary representation of the posting list that you can use the first table to do a random access lookup into.
- Write a Java Program Score a query
- This part calculates the cosine ranking score
- The input to this part is the output of the previous program, plus the( document id -> url )table, plus a query from a user.
- The output is a ranked list of the ten most relevant web pages in wikipedia.
- Do not create an accumulator for any term which occurs more than 50,000 times.
- With an efficient implementation, your program should return results in a fraction of second. Therefore, 15 seconds is the expectation for a maximum time to wait for a query response. If your program takes longer than that, something is not working right.
- You can ignore the normalization for the query if you want (this is a common optimization)
- Extra Credit
- Create a web-based user interface which collects a query from a user and displays the results.
- Either a web-page or a browser extension (harder)
- What to turn in:
- A sources.zip file containing your source code.
- A report.pdf file containing this information:
- Full names of your team members.
- Size of each on-disk data structure that you are using. For example, if you are using 3 binary files to store your data structures this would be the description of each file and its size in Megabytes.
- Approximate size of your main in-memory data structures. For example, if you are keeping a look up table in memory, you should report the size (in MB) and description of this data structure.
- A sample query that your program can process fast and the amount of time it takes to respond.
- A sample query that takes longer to process (compared to other typical queries) and the amount of time it takes to response.
- Submitting your assignment
- Turn your report into the dropbox created for this assignment on EEE.
- Please make two documents
- Make the file names:
- <StudentID>-<StudentIID>-Assignment05-Sources.zip
- <StudentID>-<StudentIID>-Assignment05-Report.pdf