Searching for a Better Way

Assignment 4

Searching through data is a frequent necessity in data processing; finding what you want quickly is typically the goal. Yet, even with a fast machine, a major determiner of speedy data retrieval is the algorithm employed. This project will give you practice in using two searching techniques, ChainSearch and skip lists, and helps give you a feel for the practical meaning of O-notation. You’ll practice code optimization, a commonly used (and abused) programming activity. This lab also demonstrates how programs can be used as test vehicles to answer design questions for larger programs. You also get practice in preparing reports based upon program-generated data.


Background

Imagine you’ve been hired as a “data jockey” by a national chain of retail stores, due to open shortly. Your boss briefs you on your first assignment:

Marketing has (what they think) is a brilliant approach to notifying people about these new stores: In addition to standard print advertising, telemarketers are going to telephone homes within a ten-mile radius of each new store to personally invite consumers to “come on in” during the grand openings.

The marketing research branch has prepared a list of the 5-digit ZIP codes which correspond to these invitation areas; it has obtained a disk file that lists, for every phone group, a corresponding ZIP code. A phone group consists of a three-digit area code followed by a three-digit phone prefix (the first three digits of the phone number). Since one phone group might be found in several ZIP code areas, there may be several entries in the file that have the same phone group, but have different ZIPs. And clearly there could be ZIP areas that have several different phone groups in them.) They call this the master file. Marketing has also procured a list of telephone customers' names and phone numbers, in phone number order; this is the phone list.

Marketing wants a look-up program that will print on the screen, in an easy-to-read format, the phone groups that are in a chosen ZIP code area. A telemarketing coordinator will use this program to assign a ZIP's phone groups to telemarketers, who will then use the phone list to make calls to potential customers (who are not on the Federal Do Not Call List). Always time-pressured, marketing wants a request for a ZIP area’s phone groups filled as quickly as possible.

This is not a one-time program. As calls to each ZIP code area’s list are completed, another set of numbers will be needed; as new stores open, marketing will need additional ZIP code areas’ phone groups.

Some requirements and constraints are known at this time:


Your assignment

Your boss tells you that your task, perhaps surprisingly, is not to write this program; rather, it is to investigate and report on which data structure ought to be used to make the search for a ZIP code, and returning a list of the phone groups associated with it, as fast as feasible. The program requirements team will use your results in determining the requirements for the search algortihm they give to the design and programming teams. The boss also admonishes you that time (as always) is of the essence, so you are to focus your efforts only on comparing skip list searching and ChainSearch; in particular, you are not to go off looking at other search approaches!

You are to report your results in this company-standard format:

  1. Title Page
  2. Table of contents. Page numbers can be any form you like as long as a reader can get to a page quickly.
  3. Problem Statement. A discussion of what you are investigating, and the results that you hope to achieve.
  4. Method. What you tried when working toward a solution to the problem.
  5. Results. What you discovered, completely and clearly presented.
  6. Conclusions and Recommendations. What you conclude from your investigation, and what you recommend to be done.
  7. Appendices. Any supportive material, too detailed or technical to be included in the main body of the text (such as references to programs and their output).

Requirements, hints, and suggestions

Requirements given in this section are stated explicitly (by such phrases as “We require you to...”). The hints and suggestions here are just that; they are not requirements. Within the requirements, you may do and name things differently, even use a completely different approach.

  1. We require you to code up classes for the skip list and chain list. We have provided skeletons for SkipList and ChainList classes to get you started, in the Searching for a Better Way zip file; you may use them to any extent you wish, including not at all! Aim for speedy algorithms, but be sure to use good style. In particular, avoid using Java features that aren't really needed to get the job done cleanly and quickly, and may unnecessarily add to the running time or memory usage. (For instance, we did not use the Java LinkedList class for the chains of the chain list; LinkedLists are doubly-linked and we only need singly-linked lists. Using our own singly-linked list saves the space used by the “previous” pointers of a LinkedList, thereby dropping the memory requirements of a chain by about one-third—a significant savings.)

    Also in the archive is a sample master file, TestZIP.txt, of about 97,000 lines. We encourage you to use it, and portions of it, as appropriate, when testing and timing your routines.

    With the exception of ArrayList, you are not allowed to use the predefined Java “collection” classes, such as java.util.TreeMap, in your solution. (The collection classes are the ones that store a collection of data, and include such classes as LinkedList, HashMap, Vector, Hashtable, and TreeMap.).

  2. You need methods to add the data from the master file to the skip list or chain list and to search those structures. Write four methods to accomplish this task; call them SkipList.buildList() and SkipList.lookup() (to build and search the skip list) and ChainList.buildList() and ChainList.lookup() (to build and search the chain list). Signatures for these methods are provided in the skeleton classes we’ve provided.

  3. To test searching using the skip list, write a method called testSkipSearch that takes as parameters the name of the master file, the value of p to be used, and the name of a text file that has a list of ZIP codes for which to search. (The format of the latter file is up to you; a text file with each zip code on one line is one good, straightforward approach.) testSkipSearch first loads up the skip list with the master file’s data and reports how long that took. Then, it records the time it takes to search for all the ZIP codes (including the time it takes to return its list of phone groups or throw an exception if the ZIP code is not found) and divides by the number of searches done; it then reports this average search time and exits.

  4. To test ChainSearch, write a method called testChainSearch that takes as parameters the name of the master file and the name of a text file that has a list of ZIP codes for which to search. (Make this file have the same format as the one used with testSkipSearch). testChainSearch builds the chain list using the master file data and reports how long that took. Then, as with testSkipSearch, testChainSearch records the time it takes to search for all the ZIP codes (including the time it takes to return its list of phone groups or throw an exception if the ZIP code is not found) and divides by the number of searches done; it then reports this average search time and exits.

    When testing chain searching, you'll need to try out several different map key functions. The easiest way to do this is to extend from the provided abstract class AbstractMapKey a concrete class that represents a map key; do this for each map key you want to try. Then, in your test program, you construct the map key(s) you want to try and pass them in turn to the ChainList constructor. Details on this approach are given in comments in the AbstractMapKey class.

  5. We require that you write some code, a driver program, that causes your testing methods to be called with various sizes of master files, with master files containing various distributions of keys, with several p values or map key methods (as appropriate) and with various sizes of, and distributions of, ZIP codes in the file that lists the ZIPs for which to search. By having the driver call your test routines repeatedly with well-chosen combinations of the choices given above, you can have it produce a file containing all of your results—and if you format the file well, its data can be pasted directly into a spreadsheet for further analysis. This approach will save you significant time over entering choices via the keyboard, one set at a time, and then copying the results from the screen by hand into a spreadsheet of your report.

    By choosing your test cases so they illuminate distinct behaviors, rather than just repeating a look at the same sort of behavior, you can save yourself enumerable tests and lots of work. For example, the number of items in the skip or chain list, which is determined by the number of lines in the master file, could easily affect the performance of the search. One good set of sizes for the master file is {100, 500, 1000, 10,000, 100,000, 250,000} records. These sizes more than cover the sizes your boss said were possible—they allow a safety margin: if the specs change somewhat, our code should still work. We require that y ou test for code with master files of the aforelisted sizes. Include whatever additional sizes you think are needed to gauge the relative performance of skip list searching and ChainSearch. If your program runs out of memory for lists beyond a certain size, be sure to report that finding!

    As for the searches that will be done during the day, three scenarios come to mind (you might think of others). The first is that a telemarketer will focus on one store opening, then another and then another, and so one. That would imply that the ZIP codes used in the search would be in groups that are numerically near each other, as ZIP codes are roughly assigned in ascending order as one moves from east to west across the U.S. So run tests where a group of the ZIP codes are numerically close to each other, then the following group are close to each other, but not to the first group, and so on.

    Another scenario is where something comes up and a search of one ZIP code is needed, then nothing happens for a while, then another ZIP code is checked, and so on. This would lead to a test file where there are a number of ZIP codes where each one may or may not be numerically close to the one preceding it. Run a test using such a group of codes.

    The third situation is a mixture of the first two: the coordinator looks up a bunch of ZIPs for a store, then one ZIP for query that has come up, then perhaps another ZIP code, then perhaps another store's group, and so on. Run at least one test with a file of ZIP codes for which to search that represents this situation.

    For the choice of p, use 0.25, 0.33, 0.50, 0.67, and 0.90. You can also use other values as you find appropriate to making searching via (and building the) skip list as fast as feasible.

    For the map key, try at least these two approaches: (1) mapkey(key) = key (that is, use the ZIP code) and (2) mapkey(k) = int(key/100) (that is, the first three digits of the zip code). Again, try others you think appropriate to produce as fast an approach as feasible.

  6. Run the test program with good test data; analyze the results; write the required report. The report should be as clear and well written as you can make it. We encourage you to use the features of Microsoft Office or other productivity tools to help you with your presentation. You must include tables of times used by various approaches, along with other appropriate data and discussion, to support your conclusions; we also strongly recommend that you present your results graphically. In the appendix, include your all the code and output on which your report relies, or give the names of the files in which they can be found; of course, turn those files in with your report.

  7. Finally: Don’t get so wrapped up with the experimentation that you leave no time to write your report!


Timing the sort routine

Java has a few different library classes that can help you time how long it takes to do a particular search. Using the System.currentTimeMillis( ) method seems to be the easiest way. It returns the current date and time, measured to the nearest millisecond, and reported as the number of elapsed milliseconds since midnight, January 1, 1970 (sometimes referred to as the “Unix epoch,” since this is how Unix-based systems have traditionally measured time). To compute the amount of time a task takes, record the time before and after the task, then subtract the difference. When the task begins, do this:

long startTime = System.currentTimeMillis();

...and when it ends, do this:

long endTime = System.currentTimeMillis();

The elapsed time is simply the difference between these two times.

Remember the following:


Deliverables

You’ll need to turn in

  1. A Microsoft Word document containing the report of your findings and conclusions, in the format and containing the information described above.

  2. A copy of the Java programs that you used to create the results presented in your report, and their output, if such are not included directly in your report.


Grading

Your grade for this project is based on your program’s correctness and style and on the report’s quality, including how well its recommendations are supported, its composition and its presentation. And the higher weight is on the report: The programs are just an investigative tool; what matters most is the report conclusions and how well those conclusions are supported.


Optional work


Written by Norman Jacobson for ICS23 Spring 2006, March 2006.
Minor revisions to clarify passages, by Norman Jacobson, November 2006.
  Inspiried by the March 1997 version of the ICS23 lab exercise
    Searching for a Better Way (that used B-trees, hashing and C++)
    written by Norman Jacobson in consultation with George Lueker,
    Spring 1988.
  Timing and reporting prose adapted from the September 2004
    version of the 23 exercise When to Be Quick, When to be Simple,
    written by Norman Jacobson and George Lueker, with added inspiration
    from Thomas A. Standish, Spring 1988.
  Discussion of skip lists includes prose from the 23 exercise Project #3:
    Always Changing Probably
by Alex Thornton, Summer 2005 version.
  SkipList Java class skeleton written by Alex Thornton, July 2005.
  ChainList Java class skeleton written by Norman Jacobson, based on
    Thornton’s SkipList class, March 2006.
Minor revision to introduce the term “phone group” to distinguish a prefix
  from an area code + prefix group, by Norman Jacobson, November 2006.
Minor edits for clarity for ICS23 Winter 2007 by Norman Jacobson, January 2007.