CompSci 261, Winter 261, Homework 3

  1. Suppose that you are implementing a set data structure, which should represent a set of approximately $1000$ $16$-bit numbers. False positives are ok in your application, so one possible representation is to use a Bloom filter. Another possibility is to represent the set as an array of $2^{16}$ bits, indicating for each number whether it is in your set or not in the set, without any false positives. Estimate the range of false-positive rates for which the Bloom filter would be the more space-efficient of these two solutions.

  2. Recall that in one version of MinHash, the sketch for a set $S$ consists of the $k$ members of $S$ having the smallest values of a hash function $h$. Suppose that $S$ is so big that it will not fit into memory, and you have to compute the sketch by making a single pass over the list of set elements. Describe how to do this in linear time using memory $O(k)$. You may use the fact that the $k$ smallest elements of any larger set may be computed in time and space proportional to the size of the set.

  3. Recall that, in a single pass over a list of votes, using only a constant amount of storage, we can find a candidate $x$ such that, if any candidate has a majority of the votes, the candidate with the majority is $x$. However, for the solution described in class, we cannot determine whether the candidate we found actually has a majority.

    Suppose that we want a stronger solution, one in which we either return a majority candidate (when there is a majority) or report that there is no majority, again using a single pass over the data. Suppose there are $c$ different candidates that might appear in the data. Suppose also that $X$ and $Y$ are two sequences of $c/2$ votes, each of which has only one vote per candidate, and that the set of candidates in $X$ is different from the set of candidates in $Y$. Prove that any streaming algorithm that solves this stronger problem must have a different state after processing $X$ than it has after processing $Y$, by finding a third sequence of votes $Z$ such that the output after $XZ$ should be different than the output after $YZ$.

    (There are $\displaystyle \binom{c}{c/2}\approx \frac{2^c}{\sqrt c}$ different choices for the sets of candidates in $X$ and $Y$, and your answer to this problem will show that each of these choices must be represented differently by any streaming algorithm. It follows from this that the number of bits of memory used by the algorithm must be proportional to $c$, in contrast to the much smaller amount of memory used by the algorithm described in class.)