1. We say that an ordinary Bloom filter, representing a given set S of items, operates correctly if, for every x, the result of querying x meets the following conditions:
- If x belongs to S, then the result of the query must be true
- If x does not belong to S, then the result of the query may or may not be true.
Describe a data structure that always uses O(1) space (regardless of the size of S) and that meets these two conditions. Would your data structure be useful in place of a Bloom filter? Why or why not?

SOLUTION: Don't store anything and always answer true. Obviously, this is not useful (except possibly as a stub to use for testing how some larger piece of software behaves when given false positives) because it gives no actual information about the data, but it meets the requirements.


2. Suppose we choose a hash function h(x), where x is an integer from 1 to n, as follows: we find a large randomly-chosen prime number P, and define h(x) = (x mod P) mod N. Is this function 1-independent? Why or why not?

SOLUTION: No, because x=1 is not equally likely to be mapped to any value mod N; instead, it is always mapped to 1.


3. Suppose we are n different sets S_i, with total size N = sum |S_i|, and that we wish to estimate the Jaccard index for every pair of sets, with an average estimation error of epsilon. Write down, using O-notation as a function of n, N, and epsilon, the time that it would take to perform this estimation using the variant of the MinHash technique that uses a single hash function.

(Recall that achieving estimation error epsilon requires finding approximately 1/epsilon^2 representative elements for each set. Be sure to include both the time to compute the MinHash representatives of the sets and the time to compare pairs of sets in your overall bound.)

SOLUTION: O(N + (n/epsilon)^2).