CompSci 261, Winter 2018, Homework 3

Suppose we are doing tabulation hashing of four-character strings, using four tables of random numbers indexed by single characters. Given a set of keys, define a "uniquely occurring character" to be a character that is in one of the strings at a certain position, but is not in any of the other strings in the same position.
1. Prove that, in every set of three four-character strings, at least two of the strings have a uniquely occurring character.
  
  Solution: If there is one position at which the three strings all have different characters, they are all uniquely occurring. Otherwise, at each position where the strings differ, two have the same character and the third string has a uniquely occurring character. So if only one of the three strings had any uniquely occurring characters, the other two would have to be equal at all of their positions, contradicting the assumption that we have a set of three strings. (Recall that, in any set, the elements are all distinct from each other.)
2. Find a set of four four-character strings none of which has a uniquely occurring character.
  
  Solution: "aaaa", "aabb", "aaba", and "aaab".
(This is the main insight needed to prove that tabulation hashing is 3-independent but not 4-independent.)
Suppose we have a cuckoo filter with $N$ cells and $b$ bits per fingerprint. In it, each key $x$ is represented by a $b$-bit fingerprint $f(x)$, which may be stored in either of two locations $h_1(x)$ and $h_1(x)\oplus h_2(f(x))$ (where "$\oplus$" represents the bitwise binary exclusive or operation, written as "^" in most programming languages). Define two keys $x$ and $y$ to be indistinguishable if they have the same fingerprint as each other and their fingerprints would be stored in the same two locations. Given $x$ and $y$, and assuming that $f$, $h_1$, and $h_2$ are all random functions, what is the probability that $x$ and $y$ are indistinguishable?

Solution: Several students asked whether we need to consider the possibility that the two locations can be the same (i.e., can $h_2$ ever be zero). I answered that we do, but if you answered the other way you should still get full credit. So let's go through both answers.

First, suppose that we allow $h_2$ to be zero. So there are two scenarios that would cause $x$ and $y$ to be indistinguishable: (A) $h_1(x)=h_1(y)$ and $f(x)=f(y)$, or (B) $f(x)=f(y)$, $h_2(f(x))\ne 0$, and $h_1(x)=h_1(y)\oplus h_2(f(x))$. Only one of these can happen, so we can add their probabilities. The probability of scenario (A) is \[ \frac{1}{N}\cdot\frac{1}{2^b}, \] because there is a $1/N$ probability of matching values of $h_1$, a $1/2^b$ probability of matching values of $f$, and both probabilities are independent. The probability of scenario (B) is \[ \frac{1}{N}\cdot\frac{1}{2^b}\cdot\frac{N-1}{N}, \] by a similar calculation where we also have to include the probability that $h_2$ is nonzero. Putting it all together the total probability is \[ \frac{1}{N}\cdot\frac{1}{2^b}\cdot\frac{2N-1}{N} \approx \frac{1}{N2^{b-1}}. \]

Next, let's repeat the same calculation but with a version of cuckoo hashing that always chooses $h_2$ to be nonzero. In this case, scenario (A) is the same as before. Scenario (B) has the same description as before, but a different calculation of its probability: we no longer need the $(N-1)/N$ factor because $h_2$ is always nonzero. So both scenarios have equal probabilities, and the total probability is exactly \[ \frac{1}{N2^{b-1}}. \]
One way to make a 2-independent hash function is to choose a large prime number $p$ (larger than the possible range of key values), choose two random coefficients $a$ and $b$ modulo $p$, and define the hash function to be the function $h(x) = ((ax+b) \bmod p) \bmod N$. Suppose we try shortcutting this step, and compute a simpler function $f(x) =(ax+b) \bmod N$. Would $f$ be a good choice hash function? Explain why or why not.

Solution: No, because if all of the keys are equal mod $N$ they will all be hashed to the same place.
Suppose that you are inserting a sequence of keys, one at a time, into a Bloom filter with 1000 cells, and that each key is mapped to four of these cells. Before adding each key $x$, you use the Bloom filter to test whether $x$ is already a member of the set, and if it says that $x$ is a member you stop the process (without inserting $x$ again). What is the maximum possible number of keys that you can insert before you stop?

Solution: 997. The first insertion causes four cells to change from false to true, and each successive insertion causes at least one cell to change from false to true (because otherwise it would appear that the value to be inserted is already present, and the algorithm would stop). So at most 997 insertions can happen before all cells become true. Once that happens, no more keys can be inserted.