Informatics 42 • Winter 2008 • David G. Kay • UC Irvine
Seventh Homework
Get your work checked and signed off by a classmate, then show it to your TA in lab. Try to get most of it done before Monday's lab, but feel free to take an extra couple of days if you need them.
For this week's homework, we're going to take the opportunity to go back and make some enhancements to the Stats program from the third homework, with an eye toward continuing to solidify your Java programming skills.
(a.1) Enhance your Stats program from the third homework to keep track of word (token) frequencies. Along with all the other statistics, it should print out the most frequently occurring token in its input file, along with the number of times that token occurred.
Think (and talk with your classmates) about how you'd implement this. There are some hints and details in the next paragraph, but don't read them until you've talked and thought about it. What data structures will you need? What's the outline of your approach?
You'll want to keep track of each token and how often it occurs. Use one of the collections in the Java library, using the token as the key and the number of times it occurs as the value. Once you've processed the text and counted the frequencies, you can go through the frequency list to find the highest value and the associated token.
You may feel uncomfortable returning to a program you wrote a few weeks ago and then making modifications to it. Parts of it may no longer seem familiar to you, or you may not remember why you made the decisions you did when you were working on it. It's important to realize that, in a real-world software development context, you might often be in a position where you have to work on something that you haven't seen in a while, or where you have to work on something that you didn't even build in the first place. Unlike undergraduate homework assignments, which you can often just "submit and forget," real-world projects tend to have a long lifespan. If you're finding it difficult to get yourself back into the swing of working on this assignment, take the opportunity to consider what you might have done differently four weeks ago to make this experience better. Are there design choices you might have made differently? Documentation that you might have written? Names that you might have chosen differently?
(a.2) But what if you want the 10 most frequent tokens? You need to sort the frequency list. We haven't talked about how to do this the conventional way, with a sorting algorithm. But you could process your frequency table into a TreeMap, with the frequency count as the key and a list of the tokens having that frequency as the value. A perusal of the API for TreeMap will suggest some useful methods. Working on a copy of your solution from part (a.1), update your program to show the top 10 most frequent tokens, instead of just the single most-frequent one.
(a.3) Instead of choosing TreeMap in part (a.2), what would be the effect of choosing HashMap instead? Is it also an appropriate choice?
(a.4) (Optional additional challenge.) Even though we haven't talked about the details of how sorting algorithms work, the Java library includes at least one algorithm that you can use. Check out the documentation for a class called Collections for more details about Java's built-in sorting algorithms, then, working on a fresh copy of your solution to (a.1), use a sorting algorithm to solve the problem of showing the 10 most frequent tokens.
(b) Now, let's keep track of actual words, not just tokens. Modify your Stats program so that it continues to gather all the statistics about tokens as before, but also gathers a parallel set of statistics about "real words."
What's a "real" word? At http://www.ics.uci.edu/~kay/wordlist.txt is a file of about 380,000 words. For our purposes, a word is real if it's on this list. (We could think about ways of managing the list, allowing the user to add new words and delete questionable ones, but let's just think about it and not do it for now.) If you're working on a very old or limited-memory machine, use this 45,000-word file instead: http://www.ics.uci.edu/~kay/wordlist-short.txt
Implementation hints and advice (think before reading; that's how you learn): You'll need to read in the wordlist and store it in some collection structure; then as you process each token in the input text file, you'll look it up on the wordlist to determine whether to include it in the real-word statistics. There is a fine opportunity for code reuse here: If you have a class for a collection of tokens, with methods that produce various statistics, you can create one instance of the class for all the tokens in the input and a second instance that will contain just the real words.
Optional additional challenge: For the word list, presumably you used one of the collection classes that has a fast search time. Substitute an (unordered) ArrayList or LinkedList in your program; then run it and see if there's a noticeable slowdown.
Written by David G. Kay, Winter 2005; modified Winter 2006.