1. In class we described van Emde Boas trees with a simplifying assumption that each item has a different key from each other item. But what if this assumption is not true? Describe how to modify the data structure to handle the case that more than one item may have the same priority. Your solution should have the same asymptotic running time per operation as the van Emde Boas tree as described in class. (Possible solutions include storing linked lists of items with the same priority, or using a separate data structure to map priorities to the set of items having those priorities. You may assume that hash tables can solve the dictionary problem in constant time per operation. You should describe how your modifications affect the operation of the three basic priority queue operations: insert, delete, and find-min). SOLUTION: Make the value associated with a key in the van emde Boas tree be a collection of all the different values that are really associated with that key. To do an insertion, search the tree to find whether the inserted key is already there, and if so add it to the collection; otherwise start a new collection. To do a deletion, remove it from the collection and if that causes the collection to become empty then do the deletion in the van emde Boas tree directly. You will need either to know when you do a deletion which insertion it corresponds to (in which case you can remember the position in the collection, e.g. as a node in a linked list, allowing you to go directly to that position when it comes time to delete it) or the value being deleted (in which case you can maintain the collection of values as another hash table). 2. Slot-size bound for hash chaining (Cormen, Leiserson, Rivest, and Stein, Introduction to Algorithms, Problem 11-2, page 250): Suppose we have a hash table with n slots, with collisions resolved by chaining -- that is, each slot stores a linked list of the (key,value) pairs that the hash function maps to that slot. Suppose also that n keys are inserted into the table. (That is, the fill ratio is 1). Each key is equally likely to be hashed to each slot, independently of the other keys. Let M be the maximum number of keys in any slot after all the keys have been inserted. Your mission is to prove an O(log n / log log n) upper bound on E[M], the expected value of M. (a) Argue that the probability Q_k that exactly k keys hash to a particular slot is given by Q_k = (1/n)^k (1 - 1/n)^(n-k) (n choose k) (The last term is the binomial coefficient that counts the number of distinct k-tuples that can be formed by n items; it equals n!/k!(n-k)!.) SOLUTION: There are (n choose k) sets of k keys that could map to that slot, each of them has probability (1/n) of actually being mapped to that slot, and each of the remaining keys has probability (1 - 1/n) of being mapped elsewhere. So for some particular set S of k keys, there is probability (1/n)^k (1 - 1/n)^(n-k) that the set of keys mapped to that slot is exactly S. Different choices of S give us disjoint events (it's not possible for two different sets of k keys to both be the exact set of keys mapped to a given slot) so we can add these probabilities over all possible choices of S, giving us the stated bound. (b) Let P_k be the probability that M = k; that is, the probability that the slot containing the most keys contains k keys. Show that P_k <= n Q_k. SOLUTION: P_k = Pr[at least one slot contains exactly k keys and no slot contains more than k keys] <= Pr[at least one slot contains exactly k keys] <= n Q_k. The final inequality is not an equality because the probabilities are not disjoint: it is possible for two or more different slots to contain exactly k keys. (c) Use Stirling's approximation to show that Q_k < e^k / k^k. Stirling's approximation is: n! = sqrt(2 pi n) (n/e)^n (1 + O(1/n)) where e ~ 2.71828 is the base of the natural logarithm. SOLUTION: We can use Stirling's approximation to expand (n choose k) = n!/k!(n-k)! ~ sqrt(2 pi n)/(sqrt(2 pi k)sqrt(2 pi n-k)) n^n/(k^k (n-k)^(n-k)) (1 + O(1/n)) < (n^n)/(k^k (n-k)^(n-k)). Here we've dropped the square roots (they always multiply out to something significantly less than one) and at the same time used them to cancel out the final (1+O(1)) term. And the e's on the top and the bottom completely cancel out. So, bringing in the other parts of Q_k from part (a), we're left only with Q_k < (n^n)/(k^k (n-k)^(n-k)) (1/n)^k (1 - 1/n)^(n-k). Ignoring the (1 - 1/n)^(n-k) part of this, splitting the n^n into n^(n-k) n^k and regrouping the terms with exponent k and the terms with exponent n-k, this simplifies to Q_k < (1 + k/(n-k))^(n-k) 1/k^k. And finally, we need the inequality that, for any x and y >= 1, (1 + x/y)^(1/y) < e^x and substituting that in gives Q_k < (e/k)^k as desired. (d) Show that there exists a constant c > 1 such that Q_k0 < 1/n^3 for k0 = c log n / log log n. Conclude that P_k < 1/n^2 for k >= k0. SOLUTION: We can rewrite our inequality from (c) as ln Q_k < - k ln k. Plugging in k0 = c ln n / ln ln n, we get ln Q_k0 <= -c (ln n / ln ln n) ln (ln n / ln ln n) = -c (ln n / ln ln n) (ln ln n - ln ln ln n). For sufficiently large n, ln n ln n >= 2 ln ln ln n, and this will simplify to -c/2 ln n. Letting c = 6 gives ln Q_k0 < - 3 ln n, or equivalently Q_k0 < 1/n^3. From part (b), P_k0 <= n Q_k0 < 1/n^2 and the same holds for larger values of k. (e) By arguing that E[M] <= n Pr[ M > c log n / log log n] + (c log n / log log n) Pr[ M <= c log n / log log n], conclude that E[M] = O(log n / log log n). SOLUTION: Clearly, no matter what random choices we make, M is at most n. Therefore, E[M] = sum_i i Pr[M=i] [definition of expected value] = (sum_{i < c log n / log log n} i Pr[M=i]) + (sum_{i >= c log n / log log n} i Pr[M=i]) [splitting the sum into two separate ranges] <= (sum_{i < c log n / log log n} (c log n / log log n) Pr[M=i]) + (sum_{i >= c log n / log log n} n Pr[M=i]) [replacing i by its maximum value within each range] = (c log n / log log n) (sum_{i < c log n / log log n} Pr[M=i]) + n (sum_{i >= c log n / log log n} Pr[M=i]) [bringing terms that don't depend on i out of the sums] <= (c log n / log log n) + n (sum_{i >= c log n / log log n} Pr[M=i]) [the first sum of probabilities is at most one] <= (c log n / log log n) + n (sum_{i >= c log n / log log n} 1/n^3) [replace probabilities in second sum using part (d)] <= (c log n / log log n) + 1 [simplify sum of at most n terms, each of which is 1/n^3, to 1/n^2] = O(log n / log log n) 3. In a cuckoo hashing data structure with two hash functions g(x) and h(x), the hash table may still operate successfully if two elements x and y are inserted with g(x)=g(y) and h(x)=h(y), but it will fail if it is given three elements x, y, and z with g(x)=g(y)=g(z) and h(x)=h(y)=h(z). (a) If a cuckoo hashing data structure has n slots in each table, and the hash functions are uniformly random, what is the probability that some particular triple (x,y,z) collides in this way? SOLUTION: g(y) has probability 1/n of being in any particular cell; in particular it has only 1/n probability of being in the same cell as g(x), so the probability that g(x)=g(y) is 1/n. Conditional on this being true, g(z) has only 1/n probability of being in the same cell as both of them, so the probability of all three being equal is 1/n^2. And the same reasoning applies to the h's, so the total probability is 1/n^4. (b) If n elements are inserted into the structure, what is the expected number of triples that collides in this way? There are n(n-1)(n-2)/6 possible triples, each of which contributes 1/n^4 to the expected value, so the total expected number is n(n-1)(n-2)/6n^4 < 1/(6n). (c) Use Markov's inequality (http://en.wikipedia.org/wiki/Markov%27s_inequality) together with your answer to part (b) to get a bound on the probability that there is at least one triple collision. According to Markov's inequality, for any random variable R and threshold t, Pr[R > t] <= E[R]/t. Here we have R is the number of triple collisions, its expected value is bounded in part (b), and t = 1. So Pr[there is at least one triple collision] <= E[number of triple collisions] <= 1/(6n).