# CompSci 261, Spring 2017, Homework 3

1. Suppose that you are implementing depth-first search, and need a set data structure to keep track of the set of vertices that your search has already visited. Your graph has $n$ vertices, each of which has an id number in the range from $0$ to $k$ (for some $k>n$).

1. If you implemented the set as a bitvector, with bit positions indexed by the id numbers of the vertices, how many bits of memory would you need, as a function of $k$ and $n$? Do not use $O$-notation.

Solution: $k+1$
2. If you implemented the set as a linear-probing hash table, with each table cell stored as a 32-bit word and with load factor $\alpha=1/2$, how many bits of memory would you need? (Just count the number of bits in the table itself, not any other structures such as the hash function.)

Solution: $64n$ (assuming a hash table with 32-bit keys only, not key-value pairs)
3. How large would $k$ have to be (as a function of $n$) in order to make the bitvector use more space than the hash table?

Solution: $64n$. (However, this solution is only valid for $n\le 2^{26}$, as otherwise the hash table could not be implemented to use only 32 bits per table cell as the problem states.)
2. This question concerns tabulation hashing, with the following parameters: the hash function $h(x)$ maps 32-bit keys to 16-bit hash values by breaking each key into four 8-bit bytes, using each byte as the index into four tables of 16-bit random numbers, and returning the bitwise exclusive or of the four numbers found in this table.

1. List four different hexadecimal numbers $a$, $b$, $c$, and $d$ such that the bitwise exclusive or of their hash values obeys the equation $h(a)\oplus h(b)\oplus h(c)\oplus h(d)=0$, no matter what numbers are used to fill the tables of random numbers. Write each number as an eight-digit value in hexadecimal; for instance the hexadecimal number 00ff00ff has all bits equal to zero in its first and third bytes, and all bits equal to one in its second and fourth bytes.

Solution: 00000000, 000000ff, 00ff0000, 00ff00ff (or any other four values chosen by fixing constant values for two of the bytes and combining two different values for each of the other two bytes, in all four possible ways).
2. If $h$ were a random function (rather than being computed by tabulation hashing) what would be the probability that the same equation would be true for your four numbers?

Solution: $1/2^{16}$. The exclusive or of four uniformly-random 16-bit values is another uniformly-random 16-bit value, which has probability only $1/2^{16}$ of being zero.
3. For a Bloom filter that stores $n$ elements, mapping each of them to exactly three positions (chosen uniformly at random, with replacement) in a table of $b$ bits, write an exact formula for the expected number of nonzero bits in the table. What is the numerical value of your formula when $n=1000$ and $b=6000$?

(Hint: First determine a formula for the probability that a single cell is zero, and subtract it from one to get the probability that a single cell is nonzero. The expected number of nonzeros is $b$ times the probability that a single cell is nonzero.)

Solution: The probability that any one cell is zero is $(1-1/b)^{3n}$. Therefore, the expected number of nonzero cells is the number of cells times one minus this probability: $$b\left(1-\left(1-\frac{1}{b}\right)^{3n}\right).$$ Plugging in $n=1000$ and $b=6000$ gives an expected number of approximately $2361$ nonzero bits. The homework didn't ask for the next calculation, but with this many nonzeros, the false positive rate would be $$\left(\frac{2361}{6000}\right)^3 \approx 6.1\%.$$
4. Suppose we are using hash chaining for a collection of $2^{16}$ keys, each of which is a 32-bit integer, with a table whose length is also $2^{16}$ (so the load factor is $\alpha=1$). In class we saw that this will work, with constant expected time per operation, as long as the hash function is $2$-independent, and that a $2$-independent hash function $h$ can be constructed by finding a large prime $p$ (larger than $2^{32}$), choosing two random numbers $a$ and $b$ mod $p$, and returning $h(x)=((ax+b)\bmod p)\bmod 2^{16}$ as the hash value. Suppose we try to simplify this by leaving out the large prime and instead returning $h(x)=(ax+b)\mod 2^{16}$ as the hash value, where $a$ and $b$ are again chosen randomly. Describe a set of keys that will cause this simplified hash function to fail (take significantly more than constant time per operation).

Solution: Choose keys that are all multiples of $2^{16}$, so they all hash to cell $b$.