ICS 23 / CSE 23 - Project #3: Lost for Words

Due date and time: Friday, November 7, 6:59pm


Introduction

Hashing is an amazing technique. Compared to balanced binary search trees, which can be searched in O(log n) time, hash tables offer the tantalizing "holy grail" of constant-time searching of a large collection of data. Given this, is seems difficult to imagine why a balanced binary search tree would ever be the right option -- why settle for logarithmic search times when you can achieve constant ones? However, hashing is not a strategy to be employed lightly. Done well, it can easily outperform most other data structures in terms of search time; done poorly, its performance can degenerate to that of a linked list.

In this project, you'll build a hash table and see an example of hashing in action, giving you the ability to compare three techniques for hashing short strings. What you will discover is that the choice of hashing techniques is critical to the performance of a hash table. As we'll discuss in lecture, there's no general-purpose hashing strategy that will be optimal (or even good, necessarily) in every case: you have to know something about a particular set of data in order to effectively hash it.


An example of hashing in action: a spell-checker

Many applications, including word processors, text editors, and email clients, include a spell-checking feature. A spell checker's job is relatively simple. Given a list of correctly-spelled words (which we'll call a wordlist) and the input text, the spell checker reports all of the misspellings (i.e. words that are not found in the wordlist) in the input. When a misspelling is found in the input, a good spell checker will also suggest words appearing in the wordlist that are somewhat like each of the misspelled words.

As an example, suppose that the wordlist contains the following words:

    bake cake main rain vase

If the input file contains the word vake, a good spell checker will state that vake is misspelled, since it does not appear in the wordlist, and will suggest that perhaps bake, cake, or vase was the word that was actually intended.

Of course, a spell checker's task is centered around searching for words in a potentially large wordlist. An efficient search is critical to the performance of the spell checker, since it will be searching the wordlist not only for the words in the input, but possibly hundreds of suggestions that it might like to make for each misspelling. As you will see, a poor hashing technique can render a spell checker effectively useless.


The program

The program you write for this project will be a rudimentary spell checker. It will read a wordlist from a file, then spell-check an input file. For each misspelling found in the input file, your program will report it, along with a sorted list of suggestions (if any).

To keep this task relatively simple in the limited time we have remaining this quarter, I have provided a large portion of the spell checker as either .java files or (in most cases) compiled .class files. In fact, I will only be requiring you to implement two relatively small parts of it:

Of course, it's required that you implement these classes according to the template provided, so that your code will work correctly with ours. Once you've completed the program, I'd like you to compare the performance of three provided hash functions (and you can implement your own and try them, as well) to get a feel for the dramatic influence that hash functions have on the performance of hash tables.


How is a hash table used in a spell checker?

The most common application of a hash table is as an implementation of an abstract data structure called a map. Recall from earlier this quarter that a map is a set of associations between keys and values, where the keys uniquely identify the values. In a previous project, you built a splay tree implementation of a map. A hash table can also be used to implement a map, where the keys are hashed and placed into the table, along with their associated values.

However, hash tables are also a handy way to efficiently implement another abstract data structure called a set. A set is a collection of unique data items. The simplest use of a set requires three operations:

In a spell checker, the limiting factor on performance is the implementation of the wordlist. A wordlist is, abstractly, a set of strings. The important operations in a spell checker are adding a word to the wordlist (useful for initially building up the wordlist as it's read from an input source) and looking up a word. A hash table is a natural choice as an implementation for a set with these operations, provided that it's possible to come up with a suitable hash function for the type of data being stored in the set. (I've provided three hash functions, two of which are (intentionally) poor, and a third that is significantly better than the other two. I'll discuss that a bit later in the write-up.)


Generating suggestions for misspelled words

There are two popular text-mode spell checkers on Unix/Linux systems. One is called ispell; the other is a GNU "free software" program called aspell. They both use similar techniques for generating suggestions for misspelled words. While checking the spelling of an input file, if a word is not found in the wordlist, five techniques are used to generate possible suggestions. Each suggestion is looked up in the wordlist; any suggestion found in the wordlist is added to the list of suggestions. The five techniques used are:

I'd like your WordChecker class to generate suggestions using these five techniques, as well. (It should be noted that there are other ways to generate suggestions, including using algorithms that pay attention not only to the letters, but what the letters actually sound like. One such well-known algorithm is called the Soundex algorithm. You are not required to implement such algorithms for this project.)


How the program will work when it is completed

The main( ) method for this program appears in a class called SpellCheck. The program takes a set of command-line parameters. Documentation about how to use them can be seen by compiling the program and running the following command:

    java SpellCheck

In brief, the command-line parameters allow you to pick a hashing strategy, specify a wordlist file, turn the program's output off (for timing testing), and specify the input text file.

When the program's output is turned on, it reads the wordlist into a hash table, then begins reading from the input text file one word at a time. Each word is looked up in the wordlist. If it is not found, the program writes the entire line of the input file that contained that word, along with an indication of which word was misspelled, and a sorted list of suggestions. For example, this input file:

This is a lne of text that has a missspelling in it.

generates the following output using the default wordlist that I provided:

This is a lne of text that has a missspelling in it.
     word not found: LNE
  perhaps you meant: 
          LANE 
          LEE 
          LIE 
          LINE 
          LONE 
          ONE 

This is a lne of text that has a missspelling in it.
     word not found: MISSSPELLING
  perhaps you meant: 
          MISS SPELLING 
          MISSPELLING 

With this output, you'll be able to test whether your classes are behaving correctly. However, this will not necessarily give you a sense of how the different hash functions perform in comparison to one another.


Comparing the performance of the provided hash functions

The -quiet command-line parameter switches off the program's output. Instead, the program does all the same work, writes no output to the console, then reports the total amount of time (in milliseconds) spent doing the work.

It should be noted that the resolution of the timer on most desktop operating systems is not precise enough to get a good timing result unless the total time is at least 100 or so milliseconds. In general, longer times are more precise than shorter ones. So, I've provided one large test case in a file called big-test.txt. Run this input file against the provided wordlist using each of the provided hash functions. You should see that the "better" hash algorithm outperforms the "lousy" one by a substantial margin, while the "degenerate" one essentially never completes (unless you have quite a bit of patience). I've provided the Java code for each of these hash functions, so you can experiment with them if you'd like. There's plenty of information on the Web about hashing strings, if you're so inclined.


Starting point

As stated earlier in the write-up, I'm providing you with a nearly-complete implementation of the program, with just two classes -- HashTable and WordChecker -- left unimplemented. The code is available in a Zip archive. (The archive is a bit large, because it includes a wordlist of over 60,000 words, along with a large test case that you can use to compare the provided hash functions.)

It should be pointed out that I didn't create the provided wordlist myself. I got it, along with the large test case in big-test.txt, from an open-source wordlist project that is hosted at wordlist.sourceforge.net. If you're interested in building a freeware or open-source application that requires a large list of English words, this would be a great place to get a wordlist for it.


Deliverables

Submit your HashTable.java and WordChecker.java files, along with the .java files containing any additional classes that you wrote (if any). You need not submit any of the other code that we provided.

Follow this link for an explanation of how to turn in your project.


Limitations

Except for java.util.ArrayList, you may not use any of the collection classes in the java.util library (e.g. LinkedList, TreeMap, HashSet). Remember, as always, you are to implement your own data structures.