ICS 23 / CSE 23 Summer 2012, Project #5: Lost for Words

Introduction

Hashing is an amazing technique. Compared to balanced binary search trees, which can be searched in O(log n) time, hash tables offer the tantalizing "holy grail" of constant-time searching of a large collection of data. Given this, it seems difficult to imagine why a balanced binary search tree would ever be the right option — why settle for logarithmic search times when you can achieve constant ones? However, hashing is not a strategy to be employed lightly. Done well, it can easily outperform most other data structures in terms of search time; done poorly, its performance can degenerate toward that of an unsorted linked list. (There's another reason, too, why you might prefer a balanced binary search tree: since they impose ordering on their keys, with smaller keys moving left and larger keys moving right, binary search trees can be iterated in ascending or descending order of their keys in linear time. Hash tables can't, since they impose no meaningful ordering on their keys; you instead have to copy all of the keys out and sort them.)

In this project, you'll build a hash table and see an example of hashing in action, giving you the ability to compare three techniques for hashing short strings. What you will discover is that the choice of hashing techniques is critical to the performance of a hash table. As we'll discuss in lecture, there's no general-purpose hashing strategy that will be optimal (or even good, necessarily) in every case: you have to know something about a particular set of data in order to effectively hash it.

An example of hashing in action: a spell checker

Many applications, including word processors, text editors, and email clients, include a spell-checking feature. A spell checker's job is relatively simple. Given a list of correctly-spelled words (which we'll call a wordlist) and the input text, the spell checker reports all of the misspellings (i.e., words that are not found in the wordlist) in the input. When a misspelling is found in the input, a good spell checker will also suggest words appearing in the wordlist that are somewhat like each of the misspelled words.

As an example, suppose that the wordlist contains the following words:

    bake cake main rain vase

If the input file contains the word vake, a good spell checker will state that vake is misspelled, since it does not appear in the wordlist, and will suggest that perhaps bake, cake, or vase was the word that was actually intended, since all three of these words are reasonably similar to the word vake, differing by only one letter each. (An even smarter spell checker might employ contextual clues to guess the word that is most appropriate in the sentence that surrounds it, but this is beyond the scope of our work here.)

Of course, a spell checker's task is centered around searching for words in a potentially large wordlist. An efficient search is critical to the performance of the spell checker, since it will be searching the wordlist not only for the words in the input, but possibly hundreds of suggestions that it might like to make for each misspelling. As you will see, a poor hashing technique can render a spell checker — or any system that requires a large number of insertions and lookups to be performed on a large hash table — effectively useless.

The program

The program you write for this project will be a rudimentary spell checker. It will read a wordlist from a file, then spell-check an input file. For each misspelling found in the input file, your program will report it, along with a sorted list of suggestions (if any).

To keep this task relatively simple in the limited time we have remaining this quarter, I have provided a large portion of the spell checker as either .java files or (in most cases) compiled code in a JAR. In fact, I will only be requiring you to implement two relatively small parts of it:

An implementation of a hash table that stores strings, using separate chaining (i.e., linked lists) to resolve collisions.
A class called WordChecker that checks the spelling of one word and makes an appropriate set of suggestions for a misspelled word.

Of course, it's required that you implement these classes according to the template provided, so that your code will work correctly with ours. Once you've completed the program, I'd like you to compare the performance of three provided hash functions (and you can implement your own and try them, as well) to get a feel for the dramatic influence that hash functions have on the performance of hash tables.

How is a hash table used in a spell checker?

The most common application of a hash table is as an implementation of an abstract data structure called a map. Recall from earlier this quarter that a map is a set of associations between keys and values, where the keys uniquely identify the values. In a previous project, you built a skip list implementation of a map. A hash table can also be used to implement a map, where the keys are hashed and placed into the table, along with their associated values.

However, hash tables (as well as binary search trees and skip lists) are also a handy way to efficiently implement another abstract data structure called a set. A set is a collection of unique data items. (One way to think about it is as a map with keys but no values.) The simplest use of a set requires three operations:

add(X): Adds an item X to the set.
lookup(X): Returns true if the item X is in the set and false otherwise.
remove(X): Removes the item X from the set.

In a spell checker, the limiting factor on performance is the implementation of the wordlist. A wordlist is, abstractly, a set of strings. The important operations in a spell checker are adding a word to the wordlist (useful for initially building up the wordlist as it's read from an input source) and looking up a word. A hash table is a natural choice as an implementation for a set with these operations, provided that it's possible to come up with a suitable hash function for the type of data being stored in the set. (In our case, the data is a set of short strings. I've provided three hash functions, two of which are (intentionally) poor, and a third that is significantly better than the other two. I'll discuss that a bit later in the write-up.)

Generating suggestions for misspelled words

There are two popular text-mode spell checkers on Unix/Linux systems. One is called ispell; the other is a GNU "free software" program called aspell. They both use similar techniques for generating suggestions for misspelled words. While checking the spelling of an input file, if a word is not found in the wordlist, five techniques are used to generate possible suggestions. Each suggestion is looked up in the wordlist; any suggestion found in the wordlist is added to the list of suggestions. The five techniques used are:

Swapping each adjacent pair of characters in the word.
In between each adjacent pair of characters in the word (also before the first character and after the last character), each letter from 'A' through 'Z' is inserted.
Deleting each character from the word.
Replacing each character in the word with each letter from 'A' through 'Z'.
Splitting the word into a pair of words by adding a space in between each adjacent pair of characters in the word. It should be noted that this will only generate a suggestion if both words in the pair are found in the wordlist.

I'd like your WordChecker class to generate suggestions using these five techniques, as well. (It should be noted that there are other ways to generate suggestions, including using algorithms that pay attention not only to the letters, but what the letters actually sound like. One such well-known algorithm is called the Soundex algorithm. You are not required to implement such algorithms for this project.)

How the program will work when it is completed

The main() method for this program appears in a class called SpellCheck. The program takes a sequence of command-line parameters that configure its behavior. Documentation about how to use them can be seen by compiling the program and running the following command:

    java SpellCheck

or by running the program in Eclipse without command-line arguments. (Note that you won't be able to compile the program as provided, because there are a few methods that are required to return values, but don't have any code in them.)

In brief, the command-line parameters allow you to pick a hashing strategy, specify a wordlist file, turn the program's output off (for timing testing), and specify the input text file.

When the program's output is turned on, it reads the wordlist into a hash table, then begins reading from the input text file one word at a time. Each word is looked up in the wordlist. If it is not found, the program writes the entire line of the input file that contained that word, along with an indication of which word was misspelled, and a sorted list of suggestions. For example, this input file:

    This is a lne of text that has a missspelling in it.

generates the following output using the default wordlist that I provided:

    This is a lne of text that has a missspelling in it.
         word not found: LNE
      perhaps you meant: 
              LANE 
              LEE 
              LIE 
              LINE 
              LONE 
              ONE 

    This is a lne of text that has a missspelling in it.
         word not found: MISSSPELLING
      perhaps you meant: 
              MISS SPELLING 
              MISSPELLING

With this output, you'll be able to test whether your classes are behaving correctly. However, this will not necessarily give you a sense of how the different hash functions perform in comparison to one another. To accomplish that, you'll need to run timing tests.

Comparing the performance of the provided hash functions

The -quiet command-line parameter switches off the program's output. The program does all the same work it normally does (reading the wordlist and input file, looking up words, generating suggestsions), but writes no output to the console. It then reports the total amount of time (in milliseconds) spent doing the work.

It should be noted that the resolution of the timer on most desktop operating systems is not precise enough to get a good timing result unless the total time is at least 100 or so milliseconds. In general, longer times are more precise than shorter ones. So, I've provided one large test case in a file called big-test.txt. Run this input file against the provided wordlist using each of the provided hash functions. You should see that the "better" hash algorithm outperforms the "lousy" one by a substantial margin, while the "degenerate" one essentially never completes (unless you have quite a bit of patience). I've provided the Java code for each of these hash functions, so you can experiment with them if you'd like. There's plenty of information on the Web about hashing strings, if you're so inclined.

In particular, take note of the fact that the "better" and "lousy" hash functions contain code that is pretty similar, yet their behavior is dramatically different. This should help you to see how carefully you should approach hashing: there's often a fine line between doing it well and doing it badly, and doing it badly is much worse than using a "predictable" data structure like a balanced binary search tree or skip list. But doing it well has tremendous benefits.

Starting point

As stated earlier in the write-up, I'm providing you with a nearly-complete implementation of the program, with just two classes — HashTable and WordChecker — left unimplemented. The code is available in a Zip archive. (The archive is a bit large, because it includes a wordlist of over 60,000 words, along with a large test case that you can use to compare the provided hash functions.)

It should be pointed out that I didn't create the provided wordlist myself. I got it, along with the large test case in big-test.txt, from an open-source wordlist project that is hosted at wordlist.sourceforge.net. If you're interested in building a freeware or open-source application that requires a large list of English words, this is a great place to get a wordlist for it.