CS 261, Spring 2013, Homework 3 Solutions


1. Suppose you have two Bloom filters FA and FB (each having the same
number of cells and the same hash functions) representing the two sets A
and B. Let FC = FA & FB be the Bloom filter formed by computing the
bitwise Boolean and of FA and FB.

(a) FC may not always be the same as the Bloom filter that would be
constructed by adding the elements of the set (A intersect B) one at a
time. Explain why not.

    Because an element of A \ B (the elements that belong to A but not
    B) and another element of B \ A may both map to the same cell of the
    Bloom filter; if they do, that cell will belong to FC even if it is
    not covered by any element of the set intersection. For instance, if
    A and B are the single-element sets {a} and {b}, and some cells are
    covered by both a and b, then those cells will be nonzero in FC even
    though the correct Bloom filter for (A intersect B) has all cells
    zero.

(b) Does FC correctly represent the set (A intersect B), in the sense
that it gives a positive answer for membership queries of all elements
in this set? Explain why or why not.

    Yes. Every element in (A intersect B) will have all its cells nonzero
    in both FA and FB, so they will all be nonzero in FC.


2. Suppose that we want to store a set S of n = 20 elements, drawn from
a universe of U = 10000 possible keys, in a Bloom filter of exactly N =
100 cells, and that we care only about the accuracy of the Bloom filter
and not its speed. For this problem size, what is the best choice of the
number of cells per key (the parameter k in the lecture)? (I.e., what
value of k gives the smallest possible probability that a key not in S
is a false positive?) What is the probability of a false positive for
this choice of k?

   I think the easiest way to solve this problem is computationally.
   The probability that a cell is missed is (1-1/100)^{20k},
   so the probability that it is hit is (1 - (1-1/100)^{20k}) and
   the probability of a false positive is (1 - (1-1/100)^{20k})^k.
   The Python expression
   
   min(((1-(1-1./100)**(20*k))**k,k) for k in range(1,30))

   gives the answer (0.09286327705662296, 3): that is, setting k=3 gives
   false positive probability approximately 0.0929. (k=4 is worse, but
   only slightly worse: its false positive probability is approximately
   0.0932.)

   For values of k greater than or equal to 30, the probability that
   there exists a missed cell is at most 100 (1-1/100)^{20k} < 0.25, and
   if there is no missed cell then there is definitely a false positive,
   so the probability of a false positive is at least 0.75, clearly not
   the minimum possible. So it's safe to terminate the computational
   search for the best k at k=30.

   It would also have been ok to derive the answer mathematically rather
   than computationally, but doing this exactly (rather than resorting
   to approximations) seems more difficult than the computational
   approach.


3. Recall that the MinHash sketch stores the k elements of a set that
have the smallest hash values (for a parameter k that determines the
accuracy of the sketch, and for a fixed hash function that is used for
all such sketches). Suppose that we wish to compute the MinHash sketch
for a set S of n elements, where n is so big that S does not fit into
main memory, and can only be accessed as a data stream. Design an
algorithm for computing the sketch using only an amount of storage
sufficient to hold O(k) set elements or their hash values, in a single
pass over the data, in total time O(n).

You may assume that each evaluation of the hash function takes time
O(1). You may also find it helpful to recall that the xth largest
element of a set of y values can be computed in time O(y) (this is
called "linear time selection"). You should not assume that k is a
constant that can be omitted from O-notation.

   The intended answer is to keep a buffer of at most 2k elements
   that is guaranteed to contain the k smallest elements seen so far.
   For each element in the data stream, we add it to the buffer, and
   then when the buffer size reaches exactly 2k we cut it back down
   to k by doing a single linear time selection computation. There
   are O(n/k) selection steps, each taking time O(k), so the total
   time is O(n). When we reach the end of the stream, we need one
   more selection step to find the k smallest keys overall.

   A Java implementation of a randomized variation of this answer
   can be found at http://www.ics.uci.edu/~eppstein/161/KBest.java
   
   Probably there are other ways of achieving the same bounds.
   Answers that use a priority queue are generally not good enough
   (they take time O(n log k) rather than O(n)). Applying the
   linear time selection algorithm to the whole data set is
   also not good enough: it takes O(n) time but also uses O(n) space.