Program 4

Implementing Maps and Sets
(and their iterators) with Hash Tables

ICS-46: Data Strcuture Implementation and Analysis

This programming assignment is designed to ensure that you know how to implement two templated classes (Map and Set) with Hash Tables. Your implementations will also include fully-functional iterators for these classes. For the map, you will be writing code that implements open-addresssing (using linear-probing), processing two dynamic arrays: the bin non-contiguous array stores indexes into the entry array (or special values indicating the bin indexs nothing); the h_entry contiguous array stores key->value pairs, along with the hashed value of the key (trading increased space for increased efficiency). For the set you will be writing simlar code that processes similar arrays.

You can test these implementations by using the standard drivers (provided with the download) and GoogleTests that we will use when grading your code for correctness; recall that you can augment the GoogleTest with whatever code you want, to aid your debugging: a GoogleTest is just a C++ program. You can also test the code you wrote for Programming Assignment #1 (which used array implementations of these classes) by substituting these hash table implementations -typically by changing a few typedef statements and adding the hash function needed for their constructors. In fact, you will be required to translate my solution to the wordgenerator program to use HashOpenMap and HashOpenSet and indicate the performance improvement (over versions using slower classes/implementations) on a large text file (wghuck.txt).

Write and use the standard insertion (<<) operator and str() method in the map and set classes for debugging. For each return a std::string that explicitly shows the contents of the parallel arrays (the status of each index, and the association for any occupied index), along with the other instance variables, so you can easily examine the structure of the hash table. Note that there is no tested requirement for what these methods return, but the versions above will make debugging easier. Here is an example of what my str() method returns for a HashOpenMap with 4 bins and 2 key->value pairs: here the result of putting the associations a->1, then b->2, then c->3, d->4, then e->5 into into the map, and then erasing a->1 from the map:

map m = HashOpenMap[
  bin[0] = empty
  bin[1] = empty
  bin[2] = empty
  bin[3] = 1; with value -1812609155 : b -> 2
  bin[4] = 0; with value -1145311956 : e -> 5
  bin[5] = was occupied
  bin[6] = 2; with value 611974285 : c -> 3
  bin[7] = 3; with value 1146565749 : d -> 4
--------------------
  h_entry[0] = -1145311956 : e -> 5
  h_entry[1] = -1812609155 : b -> 2
  h_entry[2] = 611974285 : c -> 3
  h_entry[3] = 1146565749 : d -> 4
  h_entry[4] = unused
  h_entry[5] = unused
  h_entry[6] = unused
  h_entry[7] = unused
](load_threshold=1,bins=8,entries=8,used=4,mod_count=6)

Note that the bin array is not contiguous (there are no values in the beginning indexes and a hole in the middle); the h_entry array is continguous (it values start at index 0, and all unused indexes are at the end: there are no holes). Finally, I have used a load factor of in the example above, so these two arrays have the same length (8), but with a smaller load factor, the bin array can be larger than the h_entry array.

You should download the program4 project folder and use it to create an CLion project. You will write the required methods in the hash_map.hpp and hash_set.hpp files in this project, and submit each separately in Checkmate. The project folder also contains two pairs of .hpp and .cpp files: a driver/GoogleTest pair for each class that you will write, and the driver.cpp file which has a main function that can be made to run any of the two drivers.

Important: Only one of the .cpp files with a main method can be active/compiling at any time. In the download, only the driver.cpp file is active; the GoogleTests are inactive. To make a progam inactive, select it (in the editor tab), use the Ctrl-a command to select all its lines, and then type Ctrl-/ (or click Source at the top left of the menu and choose Toggle Comment): every line will now appear in a comment (so the main function is commented-out); by using these same instructions, you can toggle back those lines to not have comments.

I recommend that you work on this assignment in pairs. Try to find someone who lives near you, with similar programming skills, and work habits/schedule: e.g., talk about whether you prefer to work mornings, nights, or weekends; what kind of commitment you will make to submit program early.

Only ONE STUDENT should submit the assignment (all parts of it). If students work in pairs, BOTH NAMES and their UCInetID names must appear in a comment at the top of each submitted program. For example if Romeo Montague (whose UCInetID is romeo1) submitted a program that he worked on with his partner Juliet Capulet (whose UCInetID is jcapulet) the comment at the top of each submitted file would appear as:

  // Submitter: romeo1(Montague, Romeo)
  // Partner  : jcapulet(Capulet, Juliet)
  // We certify that we worked cooperatively on this programming
  //   assignment, according to the rules for pair programming

If you do not know what the terms cooperatively and/or rules for pair programming mean, please read about Pair Programming before starting this assignment. If the names do not appear at the top of all your submissions in exactly this form, points will be deducted. If you are submitting by yourself, you may omit all lines but the first (Submitter). Please do turn in each program as you finish it, so that I can accurately assess the progress of the class as a whole during this assignment; do not turn in all the programs at the same time.

Print this document and carefully read it, marking any parts that contain important detailed information that you find (for review before you turn in the files).

This assignment has 3 parts: pairs should work on each part together, not split them up and do them separately. This is especially important in this assignment, because once you have a working HashOpenMap, you can reuse many parts of its code in HashOpenSet: so working on each part together will encourage maximum progress. Part 1 is worth 40 points; part 2 is worth 15 points; part 3 is worth 5 points. Remember I'm going to be running MOSS on the parts of this assignment.

IMPORTANT: The courselib contains an array implementation for the map and set data types (and you have written a BST version of map; and a trailer linked-list version of set); although this assignment requires you to use hash tables (using open-addressing), there are still many strong similarities at a high level in all these implementations (even for iterators). So, I encourage you to examine these implementations closely, and understand them; possibly, experiment with them (using their drivers or GoogleTests), while you are writing your hash table implementation: this advice is especially true as you begin to study, understand, and implement iterators. Please feel free about asking questions about these methods: both their syntax and semantics.

Probaby the best way to design this assignment is to write a general HashTable class, and then use it to implement both HashOpenMap and HashOpenSet. But, because of the complexity of using templated-classes in C++, the shortness of time, and the fact that hash tables are used slightly differently to implement maps and sets, I have "simplified" the assignment by requiring you to implement hash tables in a map first, and then reimplement them in a set; so, you can reuse lots of code by cutting/pasting/updating).

Maps (40 pts)

Maps can be implemented by a variety of data structures: they all associate unique keys with values (possibly duplicated), by storing an ics::pair holding both the key and its value. The data structure should allow us quickly to find the value associated with a key, and possibly change/update such a value.

We can implement maps efficiently by using a hash table using a simple hashing function on only the key (first) part of the pair. Most operations updatinga hash table are O(1). We will a variant of open addressing (discussed in the notes) for this hash table; this variant is very simlar to what Python currently uses to store its dict data type. Of cousre, we must supply HashOpenMap with a hash (hashing) function.

This hash can be passed to any constructor or instantiated in the template. When writing the copy constructor and operator=, try to use the fact that "the hash table arrays being copied already have good information in them" to simplify these operations (and avoiding rehashing the keys if you can): don't just put each key->value pair into an empty map (if the hashing functon is the same and the array lengths are the same). For the operator=, also remember to copy the hash function from the rhs map.

Finally, throw an IcsError exception with an appropriate message, if the load_threshold specified for any constructor is greater than 1.; unlike chaining, open-addressing requires at least as many bins/h_entrys as values in the hash table.

For the copy constructor: if the hash function for the object being constructed is the same as the function for the map to copy, we can copy the values in the bin/h_entry arrays without rehashing anything; but if the hash function is different, then the best we can do in the copy constructor is to put all the key/value pairs into the new map (with reasonably-sized arrays).

The file hash_map.hpp declares the appropriate constructors, methods, operators, helper methods, and especially instance variables (reproduced below in a tiny font below). The comments here are critical to your understanding of the bin and h_entry arrays, their relationship, and the other instance variables that are used with them.

const static int bin_empty  = -1;   //Special values that can be stored in the bin array,
const static int was_occupied = -2; //  along with non-negative indexes into the h_entry array
int*         bin     = nullptr;     //Non-contiguous int array: bin[i] is either -1, -2, or an index into the h_entry array
HashedEntry* h_entry = nullptr;     //Contiguous HashedEntry array storing hash(key) and key/value as pair of pair
                                    //bin/h_entry are allocated in constructors and in ensure_lengths
double load_threshold;              //used/bins must always be <= load_threshold (see ensure_lengths)
int bins             = 1;           //Allocated length of bin array (>= 1, so hash_compress doesn't % 0)
int entries          = 1;           //Allocated length of h_entry array (can be < bins, if load_factor < 1)
int used             = 0;           //# of key/value associations in this map; used <= entries (see ensure_lengths)
                                    //  bins-used = # of bin_empty/was_occupied values in bin array
int mod_count        = 0;           //For sensing concurrent modification

Note the many helper methods that operate on hash tables: most are easily implemented iteratively, although recursive implementations may be simple for some too. Many of the standard map methods call one or more of these helper methods to retrieve information from a hash table and/or change it (e.g., add/remove key->value assocations). These include:

void hash_compress (const KEY& key, int& hashed, int& bin_index) const; This method uses the hash function supplied by the constructor and the number of bins in the current hash table, to compute and return (via two reference parameters) both the hashed value and the the bin index of any given key. Remember to compute the absolute value of the hash function (which can return a negative result) and use the remainder operator (%) to ensure a bin index in the range [0,bins).
int find_key (const KEY& key) const; This method attempts to find the key in some bin index of a hash table: if successful it returns the index (possibly to examine or update its associated value); if unsuccessful it returns -1. The caller of this function will determine what to do with the value returned.
Recall that when using open-addressing with linear probing, you might have to check a sequence of bin array indexes, correctly processing those storing negative numbers (bin_empty and was_occupied). Also, treat the bin array circularly, if the probing runs beyond the last bin array index.
For efficiency, when probing the h_entry values indexed by the bin array looking for the matching key, first check to see if the hashed value of the key we are looking for matches the stored hash value in the h_entry array: only if this matches use the == operator to check key equality: the comparison of the two hashed values (ints) is likely to be faster than the == operator (often a std::string comparison).
operator =: Copy the load_threshold from the right side to the left, along with all the bin/h_entry arrays and related information.
int next_unoccupied (int bin_index) const; This method finds the first bin index that is unoccupied (whose status is either bin_empty or was_occupied) starting at bin_index and using linear probing; see the circularity advice for the previous method.
void ensure_lengths (int new_used); This method ensures that the bin and h_entry arrays in a hash table are long enough to store new_used values: this means both that (a) the load factor threshold is not exceeded (it is lowered by lengthing the bin array) and (b) that the h_entry array is long enough.
(a) If the new load factor (compute this ratio with doubles) would exceed the load factor threshold, create a new bin array and fill it with re-hashed indexes to all the h_entry array values (iterate through them, leaving the h_entry array unchnaged). Note that we re-hash because although the hashed value of a key stays the same, the bins increase, so the % operator might compute a different bin index when the hashed value is compressed.
(b) If we need to lengthen the contiguous h_entry array, its values all stay in the same indexes: there are now just more empty indexes at the end of this array.
We typically call ensure_lengths before adding a key->value to the map (and do not call it when changing the value associate with a key); remember that calling this method can change how hash_compress computes values in the future, because the number of bins might change.

I have defined all these methods in hash_map.hpp. I suggest examining code in the array_map.hpp or bst_map.hpp files to help you understand some of the bookkeeping requirements of these methods. Pay close attention to ensure all instance variables receive the correct values in the constructors and are used/set correctly in queries and commands.

Iterators: We return to simpl-ish/efficient iterators for hash tables. This h_entry array makes the iterators fairly easy to write and very efficient to use: construction/methods/operations are all O(1) and require O(1) extra space. Examine the types

typedef ics::pair<KEY,T>      Entry;         //Stores key and its associated value
typedef ics::pair<int,Entry>  HashedEntry;   //Stores hash(key) and Entry (key/value); see the h_entry member

Note that map iterators produce Entrys, which are defined by typedef ics::pair<KEY,T> Entry;.; the h_entry arrays stores HashedEntry, also using ics::pair. So, with the h_entry array you will use .first to process the hash value, .second.first to process the key, and .second.second to process the value associated with that key.

Feel free to question/discuss any issues for HashOpenMap on the appropriate Piazza folder, so long as no code is posted or described in too much detail.

Sets (15 pts)

Not much to say here. Take all that you learned about the hash table implementation of map and apply it to the implementation of set. I reused (copy/pasted/updated) all the helper methods discussed above. For the actual set constructors and methods, I wrote similar code. The main difference between maps and sets is that "values" in maps are key->value pairs and "values" in sets are just single values.

wordgenerator (5 pts)

The download contains the file wordgenerator.cpp: my solution to the last part of Programming Assignment #1 (which also times itself running for building the corpus and creating a priority queue of its values in order). It also contains wghuck.txt: the text, appropriately formated, of Mark Twain's book, "The Adventures of Huckleberry Finn". For this part of the assignment, you must alter the #includes and any typedefs and code (like writing hashing functions) in this program, so that it uses the improved data type implementations: HashOpenMap and HashOpenSet) instead of the "high-complexity" (low efficiency) simple array versions of these data types. In the comment at the bottom of this program, fill in the running times produced by the program (see the output line starting read_corpus time = and print_corpus (sort not print) time =) using one version with HashOpenMap and the standard ArraySet, and then with the HashOpenMap and HashOpenSet (if you finish it). Furthermore, briefly explain after writing these times what is interesting about them (and why it occurs). Use an order statistic of 2.

Testing

There are various ways to test each of the classes you are writing in this programming assignment. First, though, you should write all the methods, paying careful attention to the array implementations and previously written linked list implementations. For some, you might just boiler-plate simple code that is not correct, but allows the methods to compile, allowing other methods in the classes to be tested. For example (from the HashOpenMap class) template<class KEY,class T, int (*thash)(const KEY& a)> bool HashOpenMap<KEY,T,thash>::empty() const {return false; //On this line to indicate boiler-plate } template<class KEY,class T, int (*thash)(const KEY& a)> T HashOpenMap<KEY,T,thash>::put(const KEY& key, const T& value) {return value; //On this line to indicate boiler-plate }
Also, a constructor or void method will compile correctly with no statements in its body. Finally, don't forget to write actual code for the code you've boiler-plated.

The easiest way to start testing//debugging is by using the driver program. It allows you to perform any method call supported by the templated classes, and see the state of the class (or view the debugger). Start with with the default constructor and the insertion (<<) operator and str() method, so you can use them to debug the other methods; then test the put command; then the queries (empty, size, has_key, and has_value); then the remaining commands clear and erase; then the operators ([], =, ==, and !=); then the iterators; then finally the remaining constructors.

After you debug your code with the driver, try running the appropriate GoogleTest code. Again, this form of testing is useful only as you approach a finished solution. We will use the GoogleTest, and visual inspection, to grade this assignment. Important Note: You can put std::cout statements in the GoogleTest code (but don't accidentally remove any of the assertions, otherwise you won't be fully checking your code the way we will) while you are debugging your classes. All debugging std::cout should end in std::endl to flush the output stream: ensure the output it displayed before executing the next statement

When you run the GoogleTest, choose small values for the first and third prompts (just press return to the second prompt) or comment-out these prompts and assign small values to these variables instead. Besides an indication of which tests pass and fail, the console window will show a speed for the speed test (which will vary depending on how fast a machine you run your code on): don't worry about it. When your code is passing all the tests, put in values like 10,000 for these prompts.

Program 4

Implementing Maps and Sets (and their iterators) with Hash Tables

ICS-46: Data Strcuture Implementation and Analysis

Implementing Maps and Sets
(and their iterators) with Hash Tables