Searching through data is a frequent necessity in data processing; finding what you want quickly is typically the goal. Yet, even with a fast machine, a major determiner of speedy data retrieval is the algorithm employed. This project will give you practice in using two searching techniques, ChainSearch and skip lists, and helps give you a feel for the practical meaning of O-notation. Youll practice code optimization, a commonly used (and abused) programming activity. This lab also demonstrates how programs can be used as test vehicles to answer design questions for larger programs. You also get practice in preparing reports based upon program-generated data.
Imagine youve been hired as a data jockey by a national chain of retail stores, due to open shortly. Your boss briefs you on your first assignment:
Marketing has (what they think) is a brilliant approach to notifying people about these new stores: In addition to standard print advertising, telemarketers are going to telephone homes within a ten-mile radius of each new store to personally invite consumers to come on in during the grand openings.
The marketing research branch has prepared a list of the 5-digit ZIP codes which correspond to these invitation areas; it has obtained a disk file that lists, for every phone group, a corresponding ZIP code. A phone group consists of a three-digit area code followed by a three-digit phone prefix (the first three digits of the phone number). Since one phone group might be found in several ZIP code areas, there may be several entries in the file that have the same phone group, but have different ZIPs. And clearly there could be ZIP areas that have several different phone groups in them.) They call this the master file. Marketing has also procured a list of telephone customers' names and phone numbers, in phone number order; this is the phone list.
Marketing wants a look-up program that will print on the screen, in an easy-to-read format, the phone groups that are in a chosen ZIP code area. A telemarketing coordinator will use this program to assign a ZIP's phone groups to telemarketers, who will then use the phone list to make calls to potential customers (who are not on the Federal Do Not Call List). Always time-pressured, marketing wants a request for a ZIP areas phone groups filled as quickly as possible.
This is not a one-time program. As calls to each ZIP code areas list are completed, another set of numbers will be needed; as new stores open, marketing will need additional ZIP code areas phone groups.
Some requirements and constraints are known at this time:
Prior study has narrowed down the choices for the structures to a skip list or a chain listso those are the ones you are to investigate.
Skip lists allow for fast searches, with trade-offs between speed and storage made by how one determines p, the probability that a value appears in the next higher layer, as the running time of skip list is O(log1/p n / p). On average, each element appears in 1/(1-p) lists; that implies that, as p gets larger, so does the number of items stored in the (entire) skip list and the longer it takes to load up the skip list from the master file. For skip lists to be practical for this telemarketing effort, a reasonable value of p must be determined.
ChainSearch is even faster than using skip lists, O(1), but only if the map key is chosen so the chains are roughly of the same length and reasonably short. Adding a new item to a chain is also typically O(1), but again only if the chains are reasonably short. ChainSearch can degrade to O(n) performance, both when searching and inserting items, if the map key results in one or a few very long chains instead of a bunch of short ones. So choosing a good map key is crucial.
The program will be started up at the start of the day, at which time the master file's information is placed into the skip list or chain list; searches are then made using this memory structure. The program remains running all day, and is shut down at the end of the workday. Of course, it is possible, because of a system crash or other malfunction, that the program may need to be shut down or restarted during the day.
The master file will be a sequential text file; each physical line is
3-digit-area-code space(s) 3-digit-prefix space(s) up-to-5-digit-ZIP-code end-of-line-mark
Area codes and prefixes never start with 0, so they are always three digits. ZIP codes can start with leading zeroes. In the master file, leading zeroes are not included in the ZIP code, because some languages, like Java, take a leading zero to mean the integer is octal (base 8), rather than base 10...so reading in a ZIP of, say, 00010, will be interpreted as an 8! This is not a problem, as since we know ZIPs are really five digits, the right number of leading zeroes (if any are needed) can be included when printing them.
You can assume the master file always has the correct format.
The master file will always have at least a few hundred lines in it, and may have up to hundreds of thousands of lines.
Your boss tells you that your task, perhaps surprisingly, is not to write this program; rather, it is to investigate and report on which data structure ought to be used to make the search for a ZIP code, and returning a list of the phone groups associated with it, as fast as feasible. The program requirements team will use your results in determining the requirements for the search algortihm they give to the design and programming teams. The boss also admonishes you that time (as always) is of the essence, so you are to focus your efforts only on comparing skip list searching and ChainSearch; in particular, you are not to go off looking at other search approaches!
You are to report your results in this company-standard format:
Requirements given in this section are stated explicitly (by such phrases as We require you to...). The hints and suggestions here are just that; they are not requirements. Within the requirements, you may do and name things differently, even use a completely different approach.
We require you to code up classes for the skip list and chain list. We have provided skeletons for SkipList and ChainList classes to get you started, in the Searching for a Better Way zip file; you may use them to any extent you wish, including not at all! Aim for speedy algorithms, but be sure to use good style. In particular, avoid using Java features that aren't really needed to get the job done cleanly and quickly, and may unnecessarily add to the running time or memory usage. (For instance, we did not use the Java LinkedList class for the chains of the chain list; LinkedLists are doubly-linked and we only need singly-linked lists. Using our own singly-linked list saves the space used by the previous pointers of a LinkedList, thereby dropping the memory requirements of a chain by about one-thirda significant savings.)
Also in the archive is a sample master file, TestZIP.txt, of about 97,000 lines. We encourage you to use it, and portions of it, as appropriate, when testing and timing your routines.
With the exception of ArrayList, you are not allowed to use the predefined Java collection classes, such as java.util.TreeMap, in your solution. (The collection classes are the ones that store a collection of data, and include such classes as LinkedList, HashMap, Vector, Hashtable, and TreeMap.).
You need methods to add the data from the master file to the skip list or chain list and to search those structures. Write four methods to accomplish this task; call them SkipList.buildList() and SkipList.lookup() (to build and search the skip list) and ChainList.buildList() and ChainList.lookup() (to build and search the chain list). Signatures for these methods are provided in the skeleton classes weve provided.
To test searching using the skip list, write a method called testSkipSearch that takes as parameters the name of the master file, the value of p to be used, and the name of a text file that has a list of ZIP codes for which to search. (The format of the latter file is up to you; a text file with each zip code on one line is one good, straightforward approach.) testSkipSearch first loads up the skip list with the master files data and reports how long that took. Then, it records the time it takes to search for all the ZIP codes (including the time it takes to return its list of phone groups or throw an exception if the ZIP code is not found) and divides by the number of searches done; it then reports this average search time and exits.
To test ChainSearch, write a method called testChainSearch that takes as parameters the name of the master file and the name of a text file that has a list of ZIP codes for which to search. (Make this file have the same format as the one used with testSkipSearch). testChainSearch builds the chain list using the master file data and reports how long that took. Then, as with testSkipSearch, testChainSearch records the time it takes to search for all the ZIP codes (including the time it takes to return its list of phone groups or throw an exception if the ZIP code is not found) and divides by the number of searches done; it then reports this average search time and exits.
When testing chain searching, you'll need to try out several different map key functions. The easiest way to do this is to extend from the provided abstract class AbstractMapKey a concrete class that represents a map key; do this for each map key you want to try. Then, in your test program, you construct the map key(s) you want to try and pass them in turn to the ChainList constructor. Details on this approach are given in comments in the AbstractMapKey class.
We require that you write some code, a driver program, that causes your testing methods to be called with various sizes of master files, with master files containing various distributions of keys, with several p values or map key methods (as appropriate) and with various sizes of, and distributions of, ZIP codes in the file that lists the ZIPs for which to search. By having the driver call your test routines repeatedly with well-chosen combinations of the choices given above, you can have it produce a file containing all of your resultsand if you format the file well, its data can be pasted directly into a spreadsheet for further analysis. This approach will save you significant time over entering choices via the keyboard, one set at a time, and then copying the results from the screen by hand into a spreadsheet of your report.
By choosing your test cases so they illuminate distinct behaviors, rather than just repeating a look at the same sort of behavior, you can save yourself enumerable tests and lots of work. For example, the number of items in the skip or chain list, which is determined by the number of lines in the master file, could easily affect the performance of the search. One good set of sizes for the master file is {100, 500, 1000, 10,000, 100,000, 250,000} records. These sizes more than cover the sizes your boss said were possiblethey allow a safety margin: if the specs change somewhat, our code should still work. We require that y ou test for code with master files of the aforelisted sizes. Include whatever additional sizes you think are needed to gauge the relative performance of skip list searching and ChainSearch. If your program runs out of memory for lists beyond a certain size, be sure to report that finding!
As for the searches that will be done during the day, three scenarios come to mind (you might think of others). The first is that a telemarketer will focus on one store opening, then another and then another, and so one. That would imply that the ZIP codes used in the search would be in groups that are numerically near each other, as ZIP codes are roughly assigned in ascending order as one moves from east to west across the U.S. So run tests where a group of the ZIP codes are numerically close to each other, then the following group are close to each other, but not to the first group, and so on.
Another scenario is where something comes up and a search of one ZIP code is needed, then nothing happens for a while, then another ZIP code is checked, and so on. This would lead to a test file where there are a number of ZIP codes where each one may or may not be numerically close to the one preceding it. Run a test using such a group of codes.
The third situation is a mixture of the first two: the coordinator looks up a bunch of ZIPs for a store, then one ZIP for query that has come up, then perhaps another ZIP code, then perhaps another store's group, and so on. Run at least one test with a file of ZIP codes for which to search that represents this situation.
For the choice of p, use 0.25, 0.33, 0.50, 0.67, and 0.90. You can also use other values as you find appropriate to making searching via (and building the) skip list as fast as feasible.
For the map key, try at least these two approaches: (1) mapkey(key) = key (that is, use the ZIP code) and (2) mapkey(k) = int(key/100) (that is, the first three digits of the zip code). Again, try others you think appropriate to produce as fast an approach as feasible.
Run the test program with good test data; analyze the results; write the required report. The report should be as clear and well written as you can make it. We encourage you to use the features of Microsoft Office or other productivity tools to help you with your presentation. You must include tables of times used by various approaches, along with other appropriate data and discussion, to support your conclusions; we also strongly recommend that you present your results graphically. In the appendix, include your all the code and output on which your report relies, or give the names of the files in which they can be found; of course, turn those files in with your report.
Finally: Dont get so wrapped up with the experimentation that you leave no time to write your report!
Java has a few different library classes that can help you time how long it takes to do a particular search. Using the System.currentTimeMillis( ) method seems to be the easiest way. It returns the current date and time, measured to the nearest millisecond, and reported as the number of elapsed milliseconds since midnight, January 1, 1970 (sometimes referred to as the Unix epoch, since this is how Unix-based systems have traditionally measured time). To compute the amount of time a task takes, record the time before and after the task, then subtract the difference. When the task begins, do this:
long startTime = System.currentTimeMillis();
...and when it ends, do this:
long endTime = System.currentTimeMillis();
The elapsed time is simply the difference between these two times.
Remember the following:
Before you do any timings, be very sure your build routines work properly and your search routines actually search correctly! Timings of incorrect algorithms are useless at best, and potentially very damagingimagine what could occur if your company put into production the worse of the two searches, or a version of a search that didnt work at all, because of the errors you made in your code or your analysis. In particular, first try out your routines on a very small master file and a small list of searches; the programs run quickly and the structures are small and easy to check compared to a test run with, say, a thousand-record master files and 100 searches. Once your routines work on these small files, then you can test them using larger ones.
Only count the actual time it takes to build the list and the time to search itdont include set-up overhead or the time to write out the timing results. An easy way to remove overhead is to time the build or search routine, then compute the time again, but call a fake routine that passes in the same parameters but then immediately returns, without doing any actual work. Subtract the time this routine took from the time for your real routine, and you are left with just the actual build or search time.
For a small number of searches, the time needed to complete a search should be much less than one millisecond! Further, your program shares resources with other programs, and so is not always immediately given access to the system clock when it requests the current time. As a result, the elapsed time can be somewhat greater than the actual time taken: it might take a few milliseconds from when your sort ends until your program gets access to the system clock and records the time. For long algorithms, this choppy time measurement doesnt matter, at most adding a few milliseconds to a large time. But for fast algorithms, it can look as if a task is taking much longer than it really is; a millisecond is a lot of time for a modern computer.
To time fast algorithms more accurately, run the algorithm many times (by placing it inside a loop) and then divide the time it took by the number of times the algorithm was executed. The loop has to be repeated many times to get accurate measurements the shorter the time, the more measurements required.
So, to compute an accurate time for a fast search, the code would look something like this:
double searchTime = computeSearchTime();
double overheadTime = computeOverheadTime();
double actualSearchTime = searchTime - overheadTime;
// computeSearchTime()
long startTime = System.currentTimeMillis();
for (int i = 1; i <= NUMBER_OF_REPETITIONS; ++i)
doThisSearch(parameters);
long endTime = System.currentTimeMillis();
long elapsedTime = endTime - startTime;
return (double) elapsedTime / NUMBER_OF_REPETITIONS;
// computeOverheadTime()
long startTime = System.currentTimeMillis();
for (int i = 1; i <= NUMBER_OF_REPETITIONS; ++i)
doFakeSearch(parameters);
long endTime = System.currentTimeMillis();
long elapsedTime = endTime - startTime;
return (double) elapsedTime / NUMBER_OF_REPETITIONS;
If your timings do not agree with theory, there is something wrong with your algorithm or your timingsnot with the theory! For instance, if you graph the times of skip list searching for the various sizes of master file and see an n2 curve, you know something is wrong.
Youll need to turn in
A Microsoft Word document containing the report of your findings and conclusions, in the format and containing the information described above.
A copy of the Java programs that you used to create the results presented in your report, and their output, if such are not included directly in your report.
Your grade for this project is based on your programs correctness and style and on the reports quality, including how well its recommendations are supported, its composition and its presentation. And the higher weight is on the report: The programs are just an investigative tool; what matters most is the report conclusions and how well those conclusions are supported.
Marketing informs you that the firm from which the master file was obtained sends an update file containing from about 25 to 500 new phe group/ZIP items once every three months. It has the same format as the master file. This new information must be placed into the master file. Enhance your program so it has a function to read in the master file to the skip list or chain list, adds in the new items from the update file, and then writes out a new master file. Report on whether the adds are done more quickly when using a skip list or a chain list, and whether this difference changes your recommendation as to which structure should be used. Remember this update is done only every three months!
Similarly, marketing also get an update file ever three months that contains zip codes that have been retired; that is, they no longer exist. It is a text file, which each retired zip code on its own line. Enhance your program to process these deletions (by removing them from the skip list or chain list and then writing out a new master file), and again report on whether its faster to accomplish on a chain list or a skip list. Report whether this results changes your recommendation about which structure to use.
Also, marketing gets another update file ever three months that contains area code changes. It is a text file with each change on one line; first appears the old area code, then a blank, then the new area code. Enhance your program to update these area codes (by updating the entries in the skip list or chain list and writing out a new master file); as above, report on the relative speed of the operation on the two kinds of list and whether this changes your overall recommendation on which structure to use.