Winter 2009: Informatics 141 : Information Retrieval : Assignment 02

Goals:
1. This assignment is designed to prepare you for the rest of the class:
  1. To set up a programming environment in openlab.
  2. To be able to run a program in an environment with enough resources for building a search engine (disk space and cpu cycles)
2. To write feature detectors for things you might want to detect on the web and make available for users on the web.
3. To encourage you to use a modular architecture by forcing development of components apart from the future infrastructure.
Java Program (yes Java)
1. This assignment may be done in groups of 1, 2 or 3.
2. Write a Java Program which runs on an openlab machine in a unix environment.
3. It should be able to access the "/extra/ugrad_space" as per these instructions.
4. Given a text file your program should:
  1. Find the longest Palindrome: A palindrome consists of the longest common substring between a line of text and its reverse.
    1. What I want is a palindrome made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the text whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a palindrome.
      2. Second strip all punctuation, spaces, symbols and numbers from the text so that only [A-Za-z] remain.
      3. Convert all characters to upper or lower case.
      4. Once you identify all palindromes on a page over X characters, make sure that:
        
        It is greater than 5 characters.
        
        Less than 30% of the original text that it came from was punctuatio
  2. Find the longest Lipogram (letter "E"/"e") : A lipogram is the longest sequence of text which doesn't contain a particular letter.
    1. What I want is a lipogram made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the text whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a lipogram.
      2. Second strip all punctuation, spaces, symbols and numbers from the text so that only [A-Za-z] remain.
      3. Once you identify all lipograms on a page over X characters, make sure that:
        
        less than 30% of the original text was punctuation.
  3. Find the Rhopalic with the most number of words.
    1. For our purposes a rhopalic is a sequence of words in which each word increases by one character.
      1. The first word has N characters. The second word has N+1 characters.
      2. Words are separated by at least 1 and no more than 3 spaces, white space, or punctuation.
      3. Example: "I am the most happy person talking"
    2. What I want is a rhopalic made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the text whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a rhopalic.
  4. In measuring the length of strings we cound all the characters including spaces and punctuation.
Submitting your assignment
1. We are going to use checkmate.ics.uci.edu to submit this assignment.
2. Make the file name <StudentID>-<StudentIID>-<StudentID>-Assignment02.jar
3. The jar file should include a package named "ir.assignment02"
4. This package should include a class named "Analyzer" which contains the "main" function.
5. We will execute the program with this command: java -cp <filename>.jar ir.assignment02.Analyzer
6. Program should read the input from "in.txt" file which is at the same folder as the jar file.
7. Program should write the output to "out.txt" file. The file should be stored at the same folder as the jar file.
8. Attached are sample in.txt and out.txt files. Generated output file should exactly match the sample.
  1. Input file
  2. Output file
Evaluation:
1. Can the TA run the version that you turned in?
  1. Is it runnable?
  2. Did you follow instructions?
2. Does it do what it is supposed to do ?
  1. Output the longest palindrome, lipogram and rhopalic on test files provided by the TA?
    1. Does it output a valid {p,l,r} that is in the text?
    2. Is it the longest one in the text?
    3. Here are the files we tested on.