Informatics 141: Information Retrieval:

Assignment 07

Winter 2009

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 3/15/2009

  1. Goals:
    1. This assignment is designed to have you:
      1. Use a large posting list to rank web pages against a query.
      2. Implement cosine rank scoring efficiently.
  2. Administration:
    1. You may work in teams of 1, 2 or 3, no restrictions on membership.
  3. Write a Java Program To Process the Hadoop Output
    1. This part is necessary so that your program runs at a reasonable speed.
    1. The input to this system is the output of the posting list exercise from Assignment 06.
      1. Use the results from the 500,000 URL crawl. Here are the URLs that were crawled.
      2. You can optionally start from our output located here. (820MB compressed, 2.8GB uncompressed)
        1. 000000000 , 12 , 5 : {203645=1, 231518=4, 270201=4, 476437=2, 592873=1}
        2. The term is "000000000" there are 12 occurences of that term in 5 documents. The term was found on the wikipedia pages for "Sinclair_Coefficients" 1 time, "Free_Software_Foundation" 4 times, "Apollonian_gasket" 4 times, "List_of_Sunderland_A.F.C._managers" 2 times and "Finite_field_arithmetic" 1 time.
    1. The output of this program is two binary files.
      1. The first
        1. is a table in which each row has a term and a pointer into the second file.
      2. The second file
        1. Is a binary representation of the posting list that you can use the first table to do a random access lookup into.
  4. Write a Java Program Score a query
    1. This part calculates the cosine ranking score
    2. The input to this part is the output of the previous program, plus the document table, plus a query from a user.
    3. The output is a ranked list of the ten most relevant web pages in wikipedia.
      1. Do not create an accumulator for any term which occurs more than 50,000 times.
  5. Extra Credit
    1. Create a web-based user interface which collects a query from a user and displays the results.
      1. Either a web-page or a browser extension (harder)
  6. What to turn in:
    1. A sources.zip file containing your source code.
    2. A report.txt file containing this information:
      1. Names of your team members.
      2. Size of each on-disk data structure that you are using. For example, if you are using 3 binary files to store your data structures this would be the description of each file and its size in Mega bytes.
      3. Approximate size of your main in-memory data structures. For example, if you are keeping a look up table in memory, you should report the size (in MB) and description of this data structure.
      4. A sample query that your program can process fast and the amount of time it takes to respond.
      5. A sample query that takes longer to process (compared to other typical queries) and the amount of time it takes to response.
    3. With an efficient implementation, your program should return results in a fraction of second. Therefore, 15 seconds is our expectation for a maxmum time to wait for a query respons. If your program takes longer than that, something is not working right.
    4. Make sure to attend the discussion session on March 9th. All of the details will be discussed there.
    5. Your programs will be reviewed by Yasser on Monday March 16th from 10 am - 12 in the ICS third floor lab. If you have a conflict with this time, send him an email (before March 16th) to schedule another time.
  7. Submitting your assignment
    1. Schedule an appointment with Yasser to review your final program.