CS 261, Spring 2013, Homework 3 Solutions 1. Suppose you have two Bloom filters FA and FB (each having the same number of cells and the same hash functions) representing the two sets A and B. Let FC = FA & FB be the Bloom filter formed by computing the bitwise Boolean and of FA and FB. (a) FC may not always be the same as the Bloom filter that would be constructed by adding the elements of the set (A intersect B) one at a time. Explain why not. Because an element of A \ B (the elements that belong to A but not B) and another element of B \ A may both map to the same cell of the Bloom filter; if they do, that cell will belong to FC even if it is not covered by any element of the set intersection. For instance, if A and B are the single-element sets {a} and {b}, and some cells are covered by both a and b, then those cells will be nonzero in FC even though the correct Bloom filter for (A intersect B) has all cells zero. (b) Does FC correctly represent the set (A intersect B), in the sense that it gives a positive answer for membership queries of all elements in this set? Explain why or why not. Yes. Every element in (A intersect B) will have all its cells nonzero in both FA and FB, so they will all be nonzero in FC. 2. Suppose that we want to store a set S of n = 20 elements, drawn from a universe of U = 10000 possible keys, in a Bloom filter of exactly N = 100 cells, and that we care only about the accuracy of the Bloom filter and not its speed. For this problem size, what is the best choice of the number of cells per key (the parameter k in the lecture)? (I.e., what value of k gives the smallest possible probability that a key not in S is a false positive?) What is the probability of a false positive for this choice of k? I think the easiest way to solve this problem is computationally. The probability that a cell is missed is (1-1/100)^{20k}, so the probability that it is hit is (1 - (1-1/100)^{20k}) and the probability of a false positive is (1 - (1-1/100)^{20k})^k. The Python expression min(((1-(1-1./100)**(20*k))**k,k) for k in range(1,30)) gives the answer (0.09286327705662296, 3): that is, setting k=3 gives false positive probability approximately 0.0929. (k=4 is worse, but only slightly worse: its false positive probability is approximately 0.0932.) For values of k greater than or equal to 30, the probability that there exists a missed cell is at most 100 (1-1/100)^{20k} < 0.25, and if there is no missed cell then there is definitely a false positive, so the probability of a false positive is at least 0.75, clearly not the minimum possible. So it's safe to terminate the computational search for the best k at k=30. It would also have been ok to derive the answer mathematically rather than computationally, but doing this exactly (rather than resorting to approximations) seems more difficult than the computational approach. 3. Recall that the MinHash sketch stores the k elements of a set that have the smallest hash values (for a parameter k that determines the accuracy of the sketch, and for a fixed hash function that is used for all such sketches). Suppose that we wish to compute the MinHash sketch for a set S of n elements, where n is so big that S does not fit into main memory, and can only be accessed as a data stream. Design an algorithm for computing the sketch using only an amount of storage sufficient to hold O(k) set elements or their hash values, in a single pass over the data, in total time O(n). You may assume that each evaluation of the hash function takes time O(1). You may also find it helpful to recall that the xth largest element of a set of y values can be computed in time O(y) (this is called "linear time selection"). You should not assume that k is a constant that can be omitted from O-notation. The intended answer is to keep a buffer of at most 2k elements that is guaranteed to contain the k smallest elements seen so far. For each element in the data stream, we add it to the buffer, and then when the buffer size reaches exactly 2k we cut it back down to k by doing a single linear time selection computation. There are O(n/k) selection steps, each taking time O(k), so the total time is O(n). When we reach the end of the stream, we need one more selection step to find the k smallest keys overall. A Java implementation of a randomized variation of this answer can be found at http://www.ics.uci.edu/~eppstein/161/KBest.java Probably there are other ways of achieving the same bounds. Answers that use a priority queue are generally not good enough (they take time O(n log k) rather than O(n)). Applying the linear time selection algorithm to the whole data set is also not good enough: it takes O(n) time but also uses O(n) space.