CompSci 261, Winter 2018, Homework 2

In the deletion algorithm for linear probing, we need to know whether an index $i$ appears between $h(k)$ and $j$, where $h(k)$ is the starting index for key $k$ and all three of $i$, $h(k)$, and $j$ belong to the same block of occupied cells. Give pseudocode or a formula for performing this betweenness test, that correctly handles the case when the block of cells wraps around from the end of the hash table back to its start.

Solution: if $j \ge h(k)$ return $i \ge h(k)$ and $i \le j$, else return $i \ge h(k)$ or $i \le j$.
Recall that, for a linear probing hash table with load factor at most $1/2$, the multiplicative Chernoff bound tells us that the probability of seeing a block of exactly $B$ filled cells in one particular location in the hash array is at most $(e/4)^{B/2}$. Use this formula to give a bound on the expected number of blocks of filled cells of this length throughout the entire hash array. What is the largest value of $B$ for which your bound on the expected number is at least $1$? (Hint: How many different blocks of length $B$ are there?)

Solution: Let $n$ be the number of keys, so (with the given load factor) there are $2n$ cells in the hash table, and $2n$ places where a block of $B$ cells could start. By linearity of expectation, the expected number of filled cells of this length is at most $2n(e/4)^{B/2}$. To make this equal to 1, we need $B$ to be proportional to $\log n$; more precisely, $B=2\log_{4/e}(2n)$.
The version of cuckoo hashing that we described in class uses two hash functions $h_1(k)$ and $h_2(k)$ that both map a key $k$ to a position within a single array $A$ of key/value pairs. Sometimes cuckoo hashing is instead implemented with two different arrays $A_1$ and $A_2$, so that $h_1(k)$ maps key $k$ to a position in array $A_1$ and $h_2(k)$ maps key $k$ to a position in array $A_2$. Recall that one way for cuckoo hashing to fail is for three keys to all map to the same two array cells.

Compute the probability that a given triple of keys all map to the same two array cells, for both kinds of cuckoo hash table, when there are $n$ keys and a total of $2n$ array cells. (For the version of cuckoo hashing with two tables, you can assume that each of the two arrays has $n$ cells.) For which of these two versions of cuckoo hashing is there a greater probability of this type of failure, or are they both equal in this respect?

Solution: Let's call the three keys $x$, $y$, and $z$. With the single table of $2n$ cells, there is a $1/(2n)$ probability that $h_1(x)=h_2(x)$, and when this happens the probability that the four other hash values also land in the same place is $1/(2n)^4$. On the other hand, there is a $(2n-1)/(2n)$ probability that $h_1(x)\ne h_2(x)$, and when this happens each other key has a $2/(2n)^2$ probability of landing in the same two places (possibly with the two hash functions swapped, but this still counts as a collision). So the total probability of a three-way collision is $$\frac{1}{32n^5}+\frac{2n-1}{8n^5}=\frac{2n+3}{8n^5}\approx \frac{1}{4n^4}.$$ With two tables of $n$ cells, collisions only happen when $h_1(x)=h_1(y)=h_1(z)$ and $h_2(x)=h_2(y)=h_2(z)$; there is no possibility of the two functions being swapped. Each of $h_i(y)$ and $h_i(z)$ (for $i=1$ or $2$) has probability $1/n$ of equalling $h_i(x)$. So the calculation is much simpler, and the probability of a three-way collision is $$\frac{1}{n^4}.$$ So triple collisions are much less likely (by nearly a factor of four) in the single-table solution as they are in the two-table solution.
Recall that the load factor of a hash table is defined as $n/N$, where $n$ is the number of key-value pairs stored in the table and $N$ is the number of cells in the hash array. If a single reference to an object can be stored in a single machine word, then a single key-value pair takes two machine words to store, so the minimum amount of memory required for any structure that lists all pairs would be $2n$, while an open-addressing hash table would use $2N$ machine words. Thus, the load factor also equals $2n/2N$, the ratio between the minimum amount of memory needed and the amount of memory actually used.

Now, consider a hash chaining data structure, where we store the key-value pairs within any cell of the hash table as a linked list (a collection of objects that each have three references or pointers, for the key, value, and next object in the list) and where each table cell stores a single reference or pointer to the first object in its list. If this structure uses $M$ memory cells, define the "effective load factor" of this structure to be $2n/M$, where $2n$ is again the minimum storage required for $n$ key-value pairs and $M$ is the storage actually used.

Write a formula that translates the load factor $f$ of a hash chaining data structure to the effective load factor of the same structure. What is the value of $f$ for which the effective load factor is $1/2$?

Solution: If there are $n$ keys, and the load factor is $f$, then there are $n/f$ cells in the array of the hash table (a single pointer each), and $n$ linked list objects (with three pointers each), for a total memory usage of $n(1/f+3)$ pointers. Therefore, the effective load factor is $$\frac{2}{1/f+3}.$$ To get this effective load factor to be $1/2$, we need $f$ to be $1$. What this means is that, in order to use the same amount of memory as an open-addressing hash table with load factor $1/2$, a hash chaining data structure would have to use the higher load factor $1$. This is relevant if we want to compare the performance of different hashing algorithms with the same memory usage as each other. It wouldn't be fair to compare open addressing and hash chaining at the same (actual) load factors, because that comparison would give much more memory to the chaining algorithm than to the open addressing algorithm.