Computer Science 221: Information Retrieval
Winter 2009-2010
Department of Informatics
Donald Bren School of Information and Computer Sciences
Assignment 01
- Digital Photo (30%)
- Take a digital photo of your face. (Approximately 250 pixels by 250 pixels)
- Email it with your full name to the instructor.
- Evaluation:
- Did it show up by the due date?
- Take the Get to Know You Survey (30%)
- Located here (http://eee.uci.edu/survey/Ze99mpW15t).
- It will not work much past the due date.
- Make sure you hit submit at the end of the survey.
- Evaluation:
- Is your survey complete and in the system by the due date?
- Java Program (40%)
- Write a java program which takes the text of an html page as input and calculates the longest palindrome and longest rhopalic on that page
- What I want is a palindrome made from English words. A palindrome consists of the longest common substring between a line of text and its reverse. Here are some algorithmic guidelines to help you find that:
- First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a palindrome.
- Second strip all punctuation from the text so that only [A-Za-z0-9] remain.
- Convert all characters to upper or lower case.
- Once you identify all palindromes on a page over X characters, make sure that:
- it is greater than 5 characters.
- less than 10% of the original text was punctuation.
- Find the Rhopalic with the most number of words on the page. For our purposes a rhopalic is a sequence of words in which each word increases by one character.
- What I want is a rhopalic made from English words. Here are some algorithmic guidelines to help you find that:
- First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a rhopalic.
- For our purposes a rhopalic is a sequence of words in which each word increases by one character.
- The first word has N characters. The second word has N+1 characters. Words are separated by at least 1 and no more than 3 spaces, white space, or punctuation.
- A valid rhopalic then looks like this regular expression:
- \b[A-Za-z]{N}\b[\s!@#$%^&*()-_=+<>,.`~{}\[\]|\\/?]{1,3}\b[A-Za-z]{N+1}\b etc....
- Example: "I am the most happy person talking"
- Admin:
- This project can be done in groups of 2 or fewer people
- Evaluation:
- Turn in a java jar program which is entirely self-contained and can be run from the command line with a local file as a parameter. I will provide a test file. If your program spits out the right answer you get 100%. (e.g., "java -jar myprogram.jar testfile.txt")
- Here are the files the programs were evaluated on:
- test01,test02,test03