Hashing and Hash Tables Introduction: Hash tables are a data structure for storing and retrieving unordered information, whose primary operations are in complexity class O(1) - independent of the number of items stored in the hash table. We saw that digital trees had this same property, but only for special keys (that were digital: meaning we could decompose them into a first part of the key, a second part of the key, etc. as we can with digits in a number and characters in a String). Hash tables work with any kind of key. The most commonly used implementations of sets and maps (which are unordered) are implemented by hash tables. We will also implement a map via a hash table (including iterators) in Program #4. Here are some terms that we need to become familiar with to understand (and talk about) hash tables: hash codes, compression function, bins/buckets, overflow-chaining, probing, load factor, and open-addressing. We will discuss each below. Our Approach: We will start by discussing linear searching (using a linked list) of a collection of names. If we instead used an array of 26 indexes and put in index 0 a linked list of all names starting with "a", and in index 1 a linked list of all names starting with "b", ... and in index 25 a linked list of all names starting with "z", we could search for a name about 26 times faster by looking just in the correct index for any name (according to its first letter). This speed increase assumes each letter is equally likely to start a last name, which is not a realistic assumption. Even if the assumption is true, the complexity class for searching is still O(N), but with a constant 1/26th less than the original constant. Meaning, however long it takes to search among N names, it would take 1,000 times as long to search among 1,000N names: each of the linked lists's length would grow by a factor of 1,000. In fact, if we used an array of 26x26 (676) indexes, storing in index 0 a linked list of all names starting with "aa", and in index 1 a linked list of all names starting with "ab", ... and in index 676 a linked list of all names starting with "zz", we could search for a name 676 times faster by looking just in the correct index for any name (according to its first two letters). This speed increase assumes each letter pair is equally likely to start a last name, which is an even less realistic assumption. Even if the assumption is true, the complexity class for searching is still O(N), but with a constant of 1/676th less than the original constant. Of course, this speedup isn't achieved unless we have many names, and each box is equally likely to have a name (which isn't true: few names start with combinations like "bb", etc). And what about looking-up information that aren't strings: for example, we might want a set storing queues or a map whose keys are priority queues. So while this approach seems promising, we need to modify it to be truly useful. ------------------------------------------------------------------------------ Hash Codes: Hashing is that modification. We declare an array with any number of "bins or buckets" and use a "hash code function" to compute an integer fingerprint for any piece of data that can go into the hash table - indicating in which bin/bucket it belongs. A hash code function must always compute the same hash code for the same value, so it cannot use random numbers. We should design such a hash code function to generate the widest variety of numbers (over the range of all integers), with as small a probability as possible of two different values hashing to the same integer. Of course, in the case of using strings as values, there are more strings than integer values. For 32-bit ints, there are only about 4 billion different values -actually, exactly 4,294,967,296- but an infinite number of strings, which can be of any length, meaning any number of characters: even if we consider only strings with lower-case letters, there are 26^N different strings with N chacters; 26^7 is 8,031,810,176 so there are already more 7-letter strings than 32-bit ints. Once we have a hash code function, we use a "compression function" to convert the hash code into a legal index in our hash table. One simple compression function computes the absolute value of the hash code (hash codes should cover both negative and positive values but array indexes are always non-negative) and then computes the remainder (using the % operator) using the hash table size/length as the 2nd operand, producing a number between 0 and length-1 of the hash table. Other compression functions use-bitwise operations to compute a bit pattern in the correct range (doing so is faster than computing a remainder, but it typically works only for hash table sizes/lengths that are a perfect power of 2: of course we can arrange hash table to start at length/size 1 and always double their size to meet this constraint). Here is one example of a hash code function for strings. int hash(const std:string& s) { int hash_code = 0; for (int i = 0; i < s.size(); ++i) hash_code = 31*hash_code + s[i];//promotion: char -> int: its ASCII value return hash_code; } Here, hash("a") returns 97 ('a' has an ASCII value of 97) and hash("aa") returns 3104 (31*97 + 97). Generally, if s.size() is n (the chars array contains n values), then its hash value is given by the formula chars[0]*31^(n-1) + chars[1]*31^(n-2) + ... + chars[n-2]*31^1 + chars[n-1] So, hash("ICS46") returns 69,494,459, and hash("Richard Pattis") returns -125,886,044! Yes, because of arithmetic overflow and the standard properties of binary numbers, the result might be negative (and overflow of negative numbers can go positive again). Recall that C++ does not throw any exceptions when arithmetic operators produce values outside of the range of int: hashing is one of the few places where this behavior produces results that are still useful. Generally the hash function for all the numeric types is an int value with the same bit pattern (just interpret those bits as an int). Characters hash to their ASCII values. We can build hash functions for other types in C++ from from these: for examples strings, are an ORDERED sequence of chars and we can compute the hash code of the string by looking at every char in it. Note that generally hash("ab") != hash("ba"). When hashing strings, the order of the letters is important. In fact, a good hash code function for any ordered data type (say a queue or a stack) will produce different values for different orders of the same values. In C++ we can use the following to compute the hash codes. Here std::hash is a class whose default constructor initializes str_hash, which overloads call -the () operator- to compute a hash value for std::string. std::hash str_hash; std::cout << str_hash("ICS46") << std::endl; Interestingly, str_hash("a") returns 2,167,009,006. So it is using a hash code function different from the one shown above. I have not been able to find much on the web about how C++ actually computes the hash code for strings, but that information is irrelevant for using it. Java uses a simpler approach: every class in Java overloads the hash_code method: it is parameterless, and returns an object's hash code. If we used this style in C++, here is an example of how we could compute the hash code for a queue: it uses an iterator to compute the hash code for every value in the queue. Computing the hash code for a sequence of values in a queue is similar to computing the hash code for a sequence of characters in a string, because the order is important. We can define this same hash_code function in every class implementing a Queue, because it is the same for every queue implementation. int hash_code() { int hash = 0; for (auto e : *this) hash = 31*hash + e.hash_code(); return hash; } It is critical that the operator== and hash_code methods for a class are compatible. The key property is: if a == b then a.hash_code() == b.hash_code(). Of course the opposite is NOT true, because many different strings have equal hash codes, because there are more strings than 32-bit ints: of course it is unlikely that many different strings actually used in some problem will have the same hash code (although it is likely that some will; we will look at empirical results to get a better understanding of the actual numbers). This compatibility requirement is very important for UNORDERED collections like sets. Typically we iterate through the values of a set to compute the hash code. But, values in the set can be stored in (and iterated through in) any order. Regardless of the order these values are processed, they must compute the same hash code each time hash_code is called (because set implementations that happen to store their values in different order are still ==). So, hash codes are different, but only slightly, for unordered collections (like a set). Since no matter what order the values are stored, the hash codes should be the same, we cannot use the hash code method above, but instead must use something that accumulates the hash code values of its elements without regard to their order. Here we just add together (without the weighting of 31*) all the values. public int hash_code() { int hash = 0; for (auto e : *this) hash += e.hash_code(); //or we could use hash *= e.hash_code() return hash; } Thus, if we add or multiply together all the hash codes, it doesn't make a difference what order we do the addition or multiplication: a+b+c, b+a+c, c+b+a, etc. all compute the same value (as does a*b*c, b*a*c, c*b*a, etc.) Finally, here is something that I found interesting, when I came across it when I was reading the Java String class. The real hashCode method in string looks like the following, with cachedHash being an instance variable for all objects in the String class, which is initially set to 0. The first time hashCode is called, cachedHash is 0 so it computes the hash value and stores it in cachedHash before returning it. For every other time it is called, it immediately returns cachedHash, doing no further computation. Note that in Java, Strings are immutable, so once they are constructed their contents do not change, so once the hashCode is computed for a String object, that String object will always return the same result for its hashCode, so we don't have to recompute it. Here is how this method looks in Java. public int hashCode() { if (cachedHash != 0) return cachedHash; int hash = 0; for (int i = 0; i < chars.length; i++) { hash = 31*hash + chars[i]; //promotion of char -> int return cachedHash = hash; } If a String's computed hashCode is 0, even after its hashCode is computed and cached, it will be recomputed (because with the != 0 test, Java cannot tell the difference between a hash code that has not been computed and a hash code that has been computed with value 0). Typically, the only String whose hash code is 0 is ""; most other Strings will have a non-0 hashCode. Recomputing the hash code of "" is very quick, because it stores no values (chars.length is 0, so the loop immediately exits). We could include an extra boolean instance variable named hashIsCached, initialized to false and set it to true after caching. So we would have public int hashCode() { if (hashIsCached) return cachedHash; int hash = 0; for (int i = 0; i < chars.length; i++) { hash = 31*hash + chars[i]; //promotion of char -> int hashIsCached = true; return cachedHash = hash; } ...but that extra instance variables is a bit overkill. Of course, we also could compute the hashCode of every String WHEN IT IS CONSTRUCTED, storing it in cachedHash, and never checking this value, always returning the cached value. The upside is that the hashCode function would always just return this cached value; the downside is that we would have to compute the hash code for every String when it was created, even if we were never are going to call hashCode on it. The approach above, actually used in Java, caches a hash code only if asked to compute it at least once. ------------------------------------------------------------------------------ Hash Tables with Overflow-Chaining Hash Tables are used frequently in practice, so there has been a lot of studies, both theoretical and empirical, of hash code methods, which are at the heart of Hash Tables working efficiently. The best methods are quick to compute and return results scattered all over the range of int. Given such a hash code, the rest of the code to implement a Hash Table (see below) is straightforward. Accompanying this lecture is a program and an example of the results produced by running the program, which empirically examines the hash code function shown early in this lecture for strings (and you can write your own hash code to test too). The program itself was originally written in Java and I converted it to C++. It allows students to posit various hash code functions and test them by generating random strings (which is an easy way to test a hash function, but may not accruately model the strings used in a real application), putting them in the hash table, and counting the collisions: when two different values Hash/Compress to the same bin. We will look at this data briefly. Now let's look at how to insert a value into a hash table. We will assume that as in sets and map keys, the values are distinct. Recall the basic structure used in a hash table is an array of bins/buckets.Here, each refers to a linked list of values that hash/compress to that index. The basic picture looks like the following (here with 5 bins/buckets, 0-4, containing 6 values, v1-v6). Bin/Bucket Collisions (handled through chaining, using a linked list) +---+ | | +----+---+ +----+---+ 0 | ------>| v1 | --+--->| v2 | / | | | +--------+ +--------+ +---+ | | 1 | / | | | +---+ | | +----+---+ 2 | ------>| v3 | / | | | +--------+ +---+ | | +----+---+ +----+---+ 3 | ------>| v4 | --+--->| v5 | / | | | +--------+ +--------+ +---+ | | +----+---+ 4 | ------>| v6 | / | | | +--------+ +---+ Generally, bins can have zero, one, or many values. We say that values v1 and v2 COLLIDED in bin 0, because these two different values both hashed/compressed to the same bin in the hash table. And we have used "overflow chaining" (using a linked list) to keep track of all the "collisions"/"overflows. With good hash and compression functions, the values stored in a hash tabled should be approximately equally distributed throughout the bins. Of course, there will typically be some bins with fewer values (many may store none) and some bins with more because hash codes aren't perfect. If a hash code always returned 0 (terrible!) then we would end up with all values colliding and being stored in bin/bucket 0. ------------------------------------------------------------------------------ Hash Table Algorithms: 3 Important Algorithms manipulating Hash Tables 1) insert(for Set)/put or setting with [] and = (for Map): Use hash_code/compression to compute a bin index (for the value/key). Search all the collisions to see if the information is there (sets have unique values/maps have unique keys). If it is there: for a Sets don't change anything; for Maps change the value associated with the key in the ics::pair. If it is not there: add the information anywhere convenient in that bin: in a list node at the front, rear, wherever: there is no required ordering of information in the bins for sets and maps. 2) contains(for Set)/has_key or lookup with [] (for Map): Use hash_code/compression to compute a bin index. Search through the linked list (all the collisions) to see if the information is there. 3) erase(for Set/Map) Use hash_code/compression to compute a bin index. Search through the linked list (all the collisions) to see if the information is there, and if it is, remove it from the linked list As with the original array implementations of all the data structures, we could also decrease the array size when it becomes sparse. Recall the rule was when the array was 1/4 filled, its size would be reduced by 1/2 (leaving the result half filled -or half empty). ------------------------------ 2 more Important Algorithms The "Load Factor" of a hash table is the ratio of values it contains divided by the number of bins in the hash table. It computes the expected number of the values that will hash/compress to each bin. Generally classes using hash tables try to keep the load factor below 1 (more bins than values, so each bin contains zero or very few values -hopefully just 1, but this is typically not achieved: see empirical analysis). When the insert/put method is called for a Set/Map, if a new value is to be put in the hash table, the load factor is checked; if adding the new value will make the hash table exceed the load factor, we increase the size/length of the hash table to ensure that it is always below the specified threshold. Typically we double the size of the hash table, when we must increase its size. 4) Double Size/Length of Hash table Remember the old hash table array and allocate a new/empty one 2 times as big Traverse the old hash table (you can do it directly or use the Iterator if one is available, but that will be slower), adding each value to the new hash table, but NOT NECESSARILY IN THE SAME BIN! Instead, add it by applying the same hashing/compression again (compression will be DIFFERENT, because the size of the new hash table is doubled, so we may compute the same hash value, but compute the remainder using the DIFFERENT TABLE SIZE). By being clever, we can re-use the entire LN (list node), so we don't have to allocate any new objects when doubling the hash table size; but this makes the code harder to write. 5) Iterator: Uses a combination of index and cursor instance variables. (a) Constructor: Loop to find the first bin with a list of values (not nullptr): Succeed: set index to the bin number, cursor to the first LN in the bin Fail : set cursor to nullptr (b) ++: Advance cursor = cursor.next; if it becomes nullptr, loop to advance the index to later bins, stopping at the first non-nullptr bin Succeed: set index to bin number, cursor to the first LN in the bin Fail : set cursor to nullptr (c) Erase in Iterator Store previous cursor (in extra instance variable) to help do removal. or Store no extra information but use a trailer node in every bin (which makes removal easier; we will do this in Programming Assignment #4) Note that iterators for data types implemented by hash tables return their values in a "strange" order (based on their hash values and collisions). In fact, adding one value to a hash table can cause it to exceed its load factor, and thus it will double the number of bins (doubling the size/length of the hash table) which causes rehashing. Now, the iterator order might be completely different. So, we should never assume any special ordering for these iterators, since data types might be implemented by hash table data structures. If we want to ensure they are processed in a specific order, we should put the values produced by iterators into a PriorityQueue and iterate through it. ---------- Note that in the STL there are maps and unordered maps. Regular maps are typically implemented by BST (or self-balancing trees) whereas unordred maps are typically implemented by hash tables. Generally hash table operations are faster than trees: O(1) vs O(Log N). But if we must iterate over the data in a specific key order frequently, using regular maps might be better; of course we can always iterate over an unordered map in an ordred way by putting all its values in a priority queue and then iterate over the priority queue in any specified (by a function) order, not just the order of the keys in a BST. --------- Implementing Overflows: Linked Lists and Beyond Note that we can store in each bin an array of values, a linked list of values, a binary search tree of values, etc.(in fact, we can use anything that implements a Set data type). The reason that linked lists, and not more exotic data structures, are used, is that with a load factor <= 1 we expect to find few values in each bin, so more exotic data structures just make the coding more difficult with little gain in speed for searching a small number of values. But if we are going to use a load factor >> 1, using more complicated data structures to handle overflows might be a good idea. We can even use another Hash Table (with a different hashing function) to store the overflows. ------------------------------------------------------------------------------ Why are hash tables O(1) with good hash functions: In the worst case, a hash-code will produce exactly the same value for everything it hashes (not likely, but possible) so no matter how big the hash table is, we would go to the same bin and put/search for all N values there. Such a process would be O(N). But let's assume that we are using a hash_code that does a pretty good job (as most do: review the empirical data). For such a good hash code, if the table size/length is M, we would expect to have to search for N/M values in each bin. Thus, for any given M the method is O(N/M), but that is just O(N) because M is "a constant" so we remove it from our big-O notation. But, we are doing something a bit more subtle. By keeping the load factor <= 1, for example, we ensure that M >= N (say M is always at least N, and sometimes as much as 2*N, right when the load factor exceeds 1 and we double the size/length of the hash table). So, M IS NOT A CONSTANT, but it grows linearly with N. In fact, for a load factor <= 1, we know that M >= N. In the "worst case" M = N. That is, since M >= N N/M < N/N = 1 Therefore the complexity class of O(N/M) is reallyO(N/N) or O(1). In fact, for any constant load factor the complexity class is O(1). For a load factor of 10, the complexity is O(N / N/10) or O(10) which is the same as O(1). Certainly with this bigger load factor, we'd expect to spend 10 times longer searching but on average we'd still examine some fixed number of values in each bin, no matter how big N grows. So the magic here is that while O(N/M) for a constant M is O(N), when M is linearly related to N, growing as N grows such that N/M is <= some constant, O(N/M) is O(1). ------------------------------------------------------------------------------ Security via Hashing as a 1-way function: A Digression Given a string, it is very easy to compute the hash code of it; but it is typically not very easy, given a hash code, to determine what string(s) will hash to that value. In mathematics, such functions are called 1-way (or non-invertible) functions. There are many kinds of 1-way functions beyond hashing functions. We can use any 1-way function to provide security: let's look at one example, supplying password security, using hashing. Have you ever wondered how a computer system stores your password? If the computer stored everyone's password in a file (as a list of user-ids and their passwords), then anyone who could steal/read that file could compromise all the accounts. Here is another way to store this information: instead store a list of user-ids and the hash code of their password. When a user tries to log in, the system would hash the password they type in, and see if it matched the hashed entry in the password table. Assume the hashing method is public (if it wasn't, someone could steal it anyway), but it would be a 1-way function. So, even knowing the algorithm wouldn't allow you to easily compute a password from its hash code (but would allow you to easily compute a hash code from a password). Now, if someone could read the password file, they would see only the hash code of the passwords, but not the passwords themselves. So they wouldn't know what password to use. Of course, they could write a program that generated all possible strings, hash each, and look for one that had the same hash code. Then they could log into the system with the userid and that password (which would hash to the one stored in the password file, even if it wasn't the same password: getting the right hash code is all that is important). That is why you are encouraged to have passwords that are long, and have upper AND lower case letters (and maybe even symbols in them). It increases the size of the alphabet, so makes it harder to search for passwords that hash to the right value, when searching over all symbols in that alphabet harder. (although, hackers have lots of information about the passwords structures that are used, and can search more efficiently than generating all possible passwords) Assume that a computer could compute 10^9 (a billion) hash codes per second. There are 1.4 x 10^17 different strings of length 10 (52^10, using upper- and lower-case letters). If we tried to generate all these strings, and hash each, and compare it to the one we are looking for, it would take about 1.4 x 10^8 seconds, or about 4.5 years. This is another reason to change your password frequently (say, every 4.5 years!). Of course, such a search doesn't really need to find your password: it just needs to find a password that has the same hash code as your password. Often such hash codes generate integers that are 2^62 bits, allowing for ~ (2^10)^6 ~ 10^18 different hash values. So with an optimal hash code, every 10 letter string would have its own different hash code (not really an easy thing to do). Actual password systems are more complicated these days, and use advanced cryptographic methods. But these methods themselves are typically based on the general theory of 1-way/non-invertible functions: the functions are just much more interesting than the ones we have seen for computing the hash codes of strings. ------------------------------------------------------------------------------ The Complexity of Doubling Arrays: We have seen that array data structures are used to store all kinds of data types: I have supplied array implementations of all the templated classes, and some advanced implementations that we will write, like HeapPriorityQueue and HashMap, also are based on arrays. In all these implementations, when adding values we must determine whether to double the size/length of the array. Most times we add a value, we don't double the size/length of the array: in all the code we've seen before, we just put the value in the next unused index in the array (or in a hash table, chain it in a linked list in its bin). But, as the size/length grows, eventually we double it, which means we must copy all the values currently in the array to a new array. So most additions are O(1) but some adds are O(N). So, if we are talking about upper bounds, we might say at worst each addition is O(N) and we do N adds, so the process of doubling and copying to get N values into an array is O(N^2). But, we can derive a better (smaller) upper bound. We will talk about "amortized complexity" to analyze this case. At worst, we allocate a collection to have 1 array cell. Adding the 1st value stores it in the array and requires no copying. Adding a 2nd value doubles the size of the array, copying the 1st value (so 1 copy). Adding a 3rd value doubles the size of the array, copying the 1st-2nd values (so 2 more copies) Adding a 4th value stores it in the array and requires no copying. Adding a 5th value doubles the size of the array, copying the 1st-4th values (so 4 more copies). Adding the 6th-8th value stores them in the array and requires no copying. Adding a 9th value doubles the size of the array, copying the 1st-8th values (so 8 more copies). Adding the 10th-16th value stores them in the array and requires no copying. Etc. Each time we double the length, we copy twice as many values as before, but we can add twice as many values before having to double again. If we end up with N values in the array, what is the total number of copies that we have to make? We will see below it is O(N) -actually bounded by 2N. When we double the array size from 1 to 2, we have copied 1 value in total. When we double the array size from 2 to 4, and copy 2 more values for a total of 1+2=3 copies. When we double the array size from 4 to 8, and copy 4 more values for a total of 1+2+4=7 copies. Notice that the sum 2^0 + 2^1 + 2^2 + ... + 2^N = 2^(N+1) - 1. We used this formula before to compute the maximum number of nodes in a binary tree of height h: 2^(h+1) - 1 (which has 1 node at depth 0, 2 nodes at depth 1, 4 nodes at depth 2, etc). Here is a table. On the left, N is the number of values in the array, and on the right is the total number of values we need to copy. N Movements/Copying -------------------------- 1 0 2 1 3- 4 3 = 1 (1->2) + 2 (2->4) 5- 8 7 = 1 (1->2) + 2 (2->4) + 4 (4->8) 9-16 15 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16) 17-32 31 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16) + 16 (16->32) 33-64 63 = .... Notice that for N a perfect power of 2, there are N-1 copies. When N is 1 bigger than a power of two, there are at most 2*N-3 copies. So, the number of times that we must copy a piece of data as an array grows linearly and is O(N). Rather than thinking about "adding" sometimes doing very little work and sometime doing lots, we can think about "adding" doing a bit of extra work (a constant amount) every time that we call it: really, most of the time we don't do the work, but every so often we have to do a lot of work, not just for the new value, but for all the ones before it. This is called amortized complexity. Still, the total work done for adding N values is just O(N) -it cannot be less because N*O(1) is O(N). You will study this more in ICS-161. Another way to think of this is just how much extra work is there to double the number of values in an array. If the original length is N (say it is a power of 2), then when we add N more values, we will have to first copy each of the N values originally in the array into the new array, and then copy each of the new N values into the array. Thus, overall, adding N more values into the array requires copying N values and then adding N values without copying, for a total of 2N operations. The total complexity is still just O(N), since every add required the actual add and a total of N adds also required copying a total of N values originally in the array. Think about every add as counting for itself and one of the copying operations. ------------------------------------------------------------------------------ Mutating a Value in A Hash Table (or BST): If we are using a BST or hash table to store an object in a Set or a key to a Map, we should not mutate that object inside the Set/Map. This is because the place it is stored depends on the value/key: when we put a value into a tree, we use comparisons to determine in which subtree(s) it belongs in; when we put a value into a hash table, we compute its hash code to determine which array index it belongs in. So, if we mutate a key in a tree or hash table, it probably will not belong where it currently is: it would be in a different location in a tree or different bin in a hash table. That is, the code to locate and store a value in Set/key in a Map is based on using its state (for comparison or hashing). If we store a value in a Set/key in a Map, and then mutate it (change its state) then we may never be able to locate that value/key again. So, it is a good idea to use immutable classes for keys, or at least be extra careful not to mutate a value INSIDE a tree/key INSIDE a Map. Instead, we can remove it, change the key, and then add it back. For a hash table, both removal and addition are O(1) operations, so although awkward, changing a value in this way (remove it, change it, add it) only requires a constant amount of work to update it in the hash table. ------------------------------------------------------------------------------ Hashing with Open Addressing Instead of Chaining: The final big topic on hashing is collision/overflow resolution without overflow chaining. By far the most useful way to handle different values that hash to the same bin is to store all these results in a linear linked list that the bin points to. Hash table operations will do a linear search of such a list, which, by having a good hash code and low load factor, will generally not be long. But this approaches leaves some bins empty (about 1/3 from our empirical analysis) and linked lists require extra space that must be dynamically created and deallocated. There is an alternative way to deal with collisions. While it takes up no extra space (compared to chaining, which constructs new LN objects) it can cause an increase in the time needed to search for a value in a hash table, unless the load factor is kept low (< 70%). A load factor > 1 is not even possible, because we use a different bin for each value, so we must have as many bins ans values. The method is called Probing via Open Addressing. We will discuss 3 different forms of probing: linear, quadratic, and double hashing. In linear probing, we compute the bin for storing a value; if the table already contains a value at that bin (we must be able to distinguish indexes with and without values: we can use a parallel array of information, or an array of pointers to objects and use nullptr to detect the absence of data), we increment the index by 1 circularly (incrementing the last array index brings us back to index 0) and keep probing bins until we find an empty bin, and then put the value there. So, the find method hashes to a bin and checks if the value is there; if not, it continues from the original bin, linearly looking through other bins, until we (a) find the value we are looking for, or (b) reach an empty bin (meaning the value is not in the hash table; if it were there we would have reached it before an empty bin). Note that many values can be searched, including many that have different hash codes that happen to be a bit bigger than the hash code of the value we are looking for. For example, let's use linear probing via open addressing for the following hash table. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Thus, if "a" hashes to bin 4, we put it there because it is empty. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Likewise, if "b" hashes to bin 5 we put it there because it is empty. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | "b" | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ But now, if "c" hashes to bin 4, we have to probe bin 4 and 5 until we find that bin 6 is the first empty one after 4, and put "c" there. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | "b" | "c" | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ If we are looking to see whether "d" is in the table (say it hashes to bin 4) we start probing at bin 4 for d, then check bin 5, bin 6, and finally bin 7: it is empty, so we know "d" is not in the hash table (if it were, we would find it before the empty value). With chaining we would never examine "b" because it would be in a different bin/bucket from "a", "c", and "d". Another problem with probing via open addressing involves removing values. If we want to remove "b", we first try to find it by hashing to bin 5 and find it there. But if we actually removed it (making it empty), if we were looking for "c" the following problem would occur: we hash "c" (see above) and get bin 4, then we look at the next bin (which is now empty, because we removed "b") and we would think that "c" is not in the hash table because we reached an empty bin first. But at the time "c" was added, there was a value in bin 5, which is why "c" had to go to bin 6 and not bin 5. So, in order to know we are done when we reach an empty bin, when we remove a value, we find it, and mark its bin as "available". Bins marked "available" had previously stored values and can store new values (if we reach them in the insertion algorithm), but unlike empty bins they cannot stop the probing: probing must continue until it reaches the value to locate or has passed through all occupied and "available" bins and reached a bin that has always been empty. We can use a parallel array that stores special values to represent bin that is "empty", "used", and "available" (= was "used", but not now). Iteration over such a structure is easy: it is just a linear traversal of the array, skipping unoccupied (empty or available) positions. Likewise we can easily create a hash table twice as big, iterate through the original hash table and put (by hashing with a new compression function) each value into the new hash table. Instead of linear probling (where the bin number increases by 1 every time), we can do quadratic probing, where the bin number increases by ai+bi^2 (for i=0 the first time, i=1 the second time, etc. once we specify the values for a and b: if both are 1/2, e.g., 1/2(i+i^2) we probe hash, hash+1, hash+3, hash+6, hash+10, etc). Of course the compression function handles indexes that get bigger than the table size/length. Likewise in double hashing, we use a second hash method h2 (hash, hash+h2(1), hash+h2(2), hash+h2(3), ... ) to compute a value that is continually added to the bin index (with the compression function) until the value is located or an empty bin is found. Unlike linear probing, by using quadratic probing or double hashing, the probe sequence "after a bin" depends on how many probes it takes to reach that bin in the first place. That is, in quadratic probing if we hash to a bin, the next probe is at hash+1; but if we reach that bin on the third probe (getting there as hash+6) the next probe is hash+10. This typically improves performance by spreading out (avoiding clustering) values in the hash table. Unless space is critical, it is typically better to use overflow chaining than any kind of probing in open-addressing discussed above. Often the extra overhead of the LN is small compared to its data, although doing chaining requires using storage allocation/deallocation. Hash tables using probing via open addressing get clogged up with values in a non-linear way: as the load factor approaches 1, the searching time approaches O(N). So, if this method of probing is used, we need to keep the load factor lower, say at .7. You can simulate such a hash table and measure the performance degradation by counting the average number of probes at various load factors. If you look on the Programs link on the course web site (or the link for this lecture), you will see a download that allows you to test various hashing functions statistically. If you want, you can write your own hashing function and test it compared to the one built-in to C++. There are two drivers there, one testing chaining and one testing open addressing. It would also be useful to test hash functions for the amount of time they take to compute their result. ------------------------------------------------------------------------------ Some Mathematics and Hashing: The Birthday Problem Suppose we had a hash table of size N and a perfectly random hashing function. How many values would we have to add before the probability of a collision is 50%? The answer might surprise you because it is low. In mathematics, this is related to "The Birthday Problem": How many people (k people) must be in a room before there is at least a 50% chance that two have the same birthday?" Here we assume that each person is equally likely to be born on every day in the year (which is not true, but is close to true) The answer is nowhere near 365, or even 180: it is just 23. Why are these problems similar? The birthday problem is like having a hash table of 365 dates (lets' ignore February 29th) and hashing each person (with a good hash function) to the date they were bornd: a value between 1 and 365. We want to know what is the probability of a collision (two people hashing to the same date). Think about the problem this way. We will compute the probability of k people having DIFFERENT birthdays: then 1 minus that number is the probability of at least two people sharing a birthday. For everyone to have a different birthday, the second person must have a birthday different than the first. The third person must have a birthday different than the first two. The fourth person must have a birthday different than the first three... We can compute the probability of k people have different birthdays exactly as 365-0 365-1 365-2 365-3 365-4 365-(k-1) ------- x ------- x ------- x ------- x ------- x ... x ----------- 365 365 365 365 365 365 Let us generalize 365 to N, so we can compute the probability of k people having different birthdays given N=365 days a year, or k values different bin numbers in a hash table of any size N. This product becomes N! P = ------------- (N-k)! N**k If we choose N=365 and we compute this value for different values of k (say in Excel), we have following data. Note that the probability of two people having the same birthday (PS) is 1 - PD. k | PD (Probability of all Different birthdays) -----+--------------------------- 1 | 100.00% 2 | 99.73% 3 | 99.18% 4 | 98.36% 5 | 97.29% 6 | 95.95% 7 | 94.38% 8 | 92.57% 9 | 90.54% 10 | 88.31% 11 | 85.89% 12 | 83.30% 13 | 80.56% 14 | 77.69% 15 | 74.71% 16 | 71.64% 17 | 68.50% 18 | 65.31% 19 | 62.09% 20 | 58.86% 21 | 55.63% 22 | 52.43% 23 | 49.27% Answer: PS = 1-49.27% is now >= 50% 24 | 46.17% 25 | 43.13% 26 | 40.18% ... Thus, if we had a hash table of 365 values, storing 23 or more values into it is likely (>= 50% of the time) to lead to at least one collision: 2 values being hashed to the same bin. Using the same methodology of constructing a table, if if we had a hash table of 1,000 values, hashing 38 or more values into it is likely to lead to at least one collision. Finally, for a hash table with 1,000,000 bins, hashing 1,178 is likely to lead to at least one collision. We will prove that this number grows O(sqrt(N)) and even compute the actual coefficient (which we will find as sqrt(ln(4)) = 1.17741.... Stirling's approximation for N! is N**N x e**-N x sqrt(2pixN). If we substitute it in the formula above we get N -N N e sqrt(2piN) P = ------------------------------------- N-k -(N-k) k (N-k) e sqrt(2pi(N-k)) N +- -+ (k-N-.5) -k | k | which simplifies to e | 1 - - | | N | +- -+ So, ln(P) = -k + (k-N-.5)ln(1-k/N) Now, ln(1+x) = x - x**2/2 + x**3/3 - x**5/5 + ... (alternating sign) so ln(1-k/N) = -(k/N + k**2/(2N**2) + k**3/(3N**3) + ...) so ln(P) = -k + (N-k+.5)[k/N + k**2/(2N**2) + k**3/(3N**3) + ...] (see +/- here) Next, assume k << N, so when we multiply these two sums, we ignore any terms like k/N, k/N**2, etc., in which k's power is less than or equal to N's power; but we keep terms like k**2/N (where k's power is more than N's power). so ln(P) ~ -k + k - k**2/N + k**2/(2N) (all other terms are dropped) ln(P) ~ -k**2/(2N) and -k**2/(2N) P ~ e If we wanted to determine for what k the probability of unique values was P (the same as the probability of colisions = 1-P) we would solve k**2 ~ -2Nln(P) so, k ~ sqrt(-2Nln(P)) If we want to know for what k the probablity is about .5 (P=.5), we have k ~ sqrt(N) x sqrt(-2ln(.5)) and sqrt(-2ln(.5)) ~ 1.177410... So k ~ 1.177410 sqrt(N) Note, for N = 365, 1,000 and 1,000,000 we get k ~ 22.5, k ~ 37.2, and k ~ 1,177 which all agree very well with the values computed in Excel. If we wanted to determine for what k the probability of unique hashes was 10% (the probability of a collision was 90%), the formula is k ~ 2.145966 sqrt(N), about twice as big as for a 50% chance.