CS 261, Spring 2023, Practice Problem Set 4

  1. Suppose that we want to maintain a random sample of a cash-register-model data stream whose size depends on the number of stream elements seen so far: after seeing \(n\) elements, we should have a sample of size \(\lceil\log_2 n\rceil\), chosen uniformly at random among all possible samples of that size. However, we have no advance knowledge of how big \(n\) is going to be. Explain why, in this scenario, it is not possible to maintain a sketch that allows us to generate samples of the desired size (with the exact probability distribution specified above) using a sublinear amount of space.

    (Hint: Find a larger \(N\) for which there is a nonzero probability that the values of all \(n\) items in the current stream will need to be remembered later, after \(N\) elements have been seen.)

  2. For the hierarchical deterministic \(\varepsilon\)-approximation method, how many unmerged blocks would exist after exactly 1000 elements have been added? (Hint: use binary. You do not have to calculate the lengths of the approximations stored for the blocks.)

  3. Suppose we have two Boyer--Moore majority sketches \((m_A,c_A)\) and \((m_B,c_B)\) for input sequences \(A\) and \(B\). Explain how to use them to compute a single sketch for the concatenation of sequences \(AB\), again consisting of a pair of numbers \((m,c)\). Your combined sketch does not have to be equal to the sketch that the majority algorithm would produce for \(AB\), but it should provide an estimate for the number \(\operatorname{count}(x)\) of occurrences of each element \(x\) that is bounded between \(\operatorname{count}(x)-|AB|/2\) and \(\operatorname{count}(x)\), just like the majority algorithm would. (In particular, this implies that if \(AB\) has a majority element, your combination will choose that element as its value of \(m\)). Explain why your combination has this property.

  4. The lecture notes for Thursday include the invertible Bloom filter, a somewhat complicated data structure that can report up to \(k\) stragglers (elements that were inserted and then never removed) in a turnstile data stream, for a given parameter \(k\). Describe how to solve the special case of this problem for \(k=1\) using one of the streaming data structures from the lecture notes for Tuesday.