Computer Science 221: Information Retrieval

Winter 2009-2010

Department of Informatics

Donald Bren School of Information and Computer Sciences

Home | Administrative Policies | Course Structure | Materials | Assignment Schedule

Assignment 01

  1. Digital Photo (30%)
    1. Take a digital photo of your face. (Approximately 250 pixels by 250 pixels)
    2. Email it with your full name to the instructor.
    3. Evaluation:
      1. Did it show up by the due date?
  2. Take the Get to Know You Survey (30%)
    1. Located here (http://eee.uci.edu/survey/Ze99mpW15t).
    2. It will not work much past the due date.
    3. Make sure you hit submit at the end of the survey.
    4. Evaluation:
      1. Is your survey complete and in the system by the due date?
  3. Java Program (40%)
    1. Write a java program which takes the text of an html page as input and calculates the longest palindrome and longest rhopalic on that page
      1. What I want is a palindrome made from English words. A palindrome consists of the longest common substring between a line of text and its reverse. Here are some algorithmic guidelines to help you find that:
        1. First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a palindrome.
        2. Second strip all punctuation from the text so that only [A-Za-z0-9] remain.
        3. Convert all characters to upper or lower case.
        4. Once you identify all palindromes on a page over X characters, make sure that:
          1. it is greater than 5 characters.
          2. less than 10% of the original text was punctuation.
      2. Find the Rhopalic with the most number of words on the page. For our purposes a rhopalic is a sequence of words in which each word increases by one character.
        1. What I want is a rhopalic made from English words. Here are some algorithmic guidelines to help you find that:
          1. First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a rhopalic.
          2. For our purposes a rhopalic is a sequence of words in which each word increases by one character.
            1. The first word has N characters. The second word has N+1 characters. Words are separated by at least 1 and no more than 3 spaces, white space, or punctuation.
            2. A valid rhopalic then looks like this regular expression:
              1. \b[A-Za-z]{N}\b[\s!@#$%^&*()-_=+<>,.`~{}\[\]|\\/?]{1,3}\b[A-Za-z]{N+1}\b etc....
              2. Example: "I am the most happy person talking"
    2. Admin:
      1. This project can be done in groups of 2 or fewer people
    3. Evaluation:
      1. Turn in a java jar program which is entirely self-contained and can be run from the command line with a local file as a parameter. I will provide a test file. If your program spits out the right answer you get 100%. (e.g., "java -jar myprogram.jar testfile.txt")
      2. Here are the files the programs were evaluated on:
        1. test01,test02,test03