CompSci 261, Spring 2017, Homework 2

  1. For a hash table that uses hash chaining with $n$ keys, $N$ table cells, and load factor $\alpha=n/N$, prove that the probability that key $k$ belongs to a chain of size exactly $s$ is at most $\alpha^{s-1}/(s-1)!$.

    (Hint: How many ways are there for $s-1$ other elements to be placed into the same chain and what is the probability that any one of these ways happens?)

    Solution: The number of ways of selecting $s-1$ other keys that could collide with $k$ is $$\binom{n-1}{s-1}<\frac{n^{s-1}}{(s-1)!},$$ and the probability that a particular selection of $s-1$ other keys does collide, producing a chain of size exactly $s$, is $$\frac{1}{N^{s-1}},$$ because each of the other keys has an independent $1/N$ chance of colliding. By the union bound, the probability that one of these possible collision events occurs is at most $$\frac{n^{s-1}}{(s-1)!} \cdot \frac{1}{N^{s-1}} = \frac{\alpha^{s-1}}{(s-1)!}.$$
  2. The simplest way that a cuckoo hash table could fail to insert all keys would be for three of its keys to have the same two hash table cells as their hash values.

    1. For a cuckoo hash table with $n$ keys, two tables of $N/2$ table cells each, and load factor $\alpha=n/N$, what is the expected number of triples of keys for which this happens? You can calculate this expected number as the total number of triples of keys, multiplied by the probability that any given triple has this problem.

      Solution: $$\binom{n}{3}\frac{1}{(N/2)^4}$$
    2. Use Markov's inequality (and the observation that when the number of bad triples is less than one, it can only be zero) to prove that the probability of getting this kind of failure is O(1/n) when $\alpha$ is a constant less than $1/2$.

      Solution: By the assumption that $\alpha$ is constant, the expected number of bad triples can be simplified to $O(n^3/n^4)=O(1/n)$. If there is even a single bad triple, then the actual number of bad triples is larger than its expectation by a factor of $\Omega(n)$. By Markov's inequality, the probability of the actual value being so much higher than its expectation is $O(1/n)$.
  3. Give pseudocode (or, if you prefer, Python code) for an algorithm that uses a dictionary to process a list of items and returns as output a list of the items that appear only once in the input. For example, when your algorithm is run with the list [2,7,1,8,2,8] as its argument, it should return the list [1,7] or [7,1] (the output ordering is unimportant). Your algorithm should not sort the input (because sorting is too slow) and when the input length is $n$ it should use $O(n)$ dictionary operations. The dictionary operations you can use include creating an empty dictionary, testing whether a key is present in the dictionary, setting the value associated with a key (or adding the key if it is not already present), looking up the value associated with a key, and listing all key-value pairs.

    Solution:
    counts = {}
    for x in input:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return [x for x,c in counts.items() if c == 1]
    Or, more cleanly, using multiple sets in place of a single dictionary:
    once = set()
    twice = set()
    for x in input:
        if x in once:
            twice.add(x)
        else:
            once.add(x)
    return [x for x in once if x not in twice]
  4. Suppose that you are choosing a hashing algorithm to use as part of a web server, and you want to prevent denial-of-service attacks in which an attacker, knowing which algorithm you use and which keys are in your hash table, sends requests that try to make your server use as much processing time as possible. You may assume that these requests can only cause your server to perform hash table lookups, but not to change the contents of the hash table. Which of linear probing, cuckoo hashing, or hash chaining would be the most secure against this type of attack? Explain your answer.

    Solution: Cuckoo hashing, because the fact that its queries take constant worst-case time means that an attacker cannot slow them down by finding and then repeatedly querying keys with many collisions.