INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015

Assignment #2
Text Processing Functions

This assignment is to be done individually; you may not use code written by your classmates. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. As stated in the collaboration guidelines, concealing the origin of a piece of code is plagiarism. Use Piazza for general questions whose answers can benefit you and everyone.

General Specifications

  1. You may use Java, Python, or Scheme/Racket for this assignment. Java is the safest choice because the assignment is written with Java in mind and contains a variety of helpful Java resources. Using Python or Scheme will require you to translate these Java resources.
    1. If you use Java, your solution must fill out the program skeleton provided. (i) Fill in each method according to its Javadoc specification. (ii) Feel free to create additional methods / classes where necessary.
    2. If you don’t use Java, you should produce a similar skeleton to start with and fill it out. You should also be very precise with instructions for how to run your program -– what programs are needed, what versions, and so on. If the TA can’t run your program, your grade will reflect that.
  2. You should test your code thoroughly, of course, with test data you create. You may exchange test data with anyone in the class. We will test your program with our own text files.
  3. At points, this assignment may be underspecified (i.e., not fully describe what to do in every situation). In those cases, post your questions on Piazza or check with the TA. For minor issues, make your own assumptions and document them.

Project Skeleton: http://www.ics.uci.edu/~kay/courses/i141/hw/Assignment2.zip

Part A: Utilities (20 points)
Write a method that reads in a text file and returns a list of the tokens in that file. Write a method to print out frequency results.

Part B: Word Frequencies (20 points)
Count the total number of words and their frequencies in a token list. 

Part C: 2-grams (30 points)
A 2-gram is two words that occur consecutively in a file. For example, "two words", "words that", and "that occur" are all 2-grams from the previous sentence.

Count the total number of 2-grams and their frequencies in a token list.

Part D: Palindromes (30 points)
A palindrome is a words or phrase that reads the same in both directions. For example, these are all palindromes: "kayak", "Do geese see god", "A man, a plan, a canal--Panama". Count the total number of palindromes and their frequencies in a text file.

Once you have implemented your palindrome counting algorithm, please perform a short analysis of its runtime complexity: Does it run in linear time relative to the size of the input? polynomial time? exponential time? This analysis should go in the analysis.txt file in this package.

Submitting Your Assignment
Submit your assignment via Checkmate (checkmate.ics.uci.edu).

Evaluation Criteria
Your assignment will be graded on the following four criteria:


David G. Kay, kay@uci.edu
Friday, January 23, 2015 4:03 PM