CompSci 261, Spring 2017, Homework 4

Suppose that we wish to read a stream of numbers and maintain the average and standard deviation of the numbers. Recall that the average of $n$ numbers $x_1,x_2,\dots x_n$ is \[ \bar x = \sum_{i=1}^n \frac{x_i}{n} \] and that the (uncorrected) standard deviation is \[ \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar x)^2}. \]
1. Show that both the average and standard deviation can be calculated in constant time from the three values $n$, $p=\sum x_i$, and $q=\sum x_i^2$.
2. Use your answer to part (a) to design a streaming data structure that can process each input number in constant time, that can report the average and standard deviation in constant time, and that uses only a constant number of words of memory.
Recall that the Jaccard similarity of two sets $A$ and $B$ is \[ \frac{|A\cap B|}{|A\cup B|}. \] while the cosine similarity (of the 0-1 vectors corresponding to these sets) is \[ \frac{|A\cap B|}{\sqrt{|A|\cdot|B|}}. \] If two sets $A$ and $B$ have equal sizes and their cosine similarity is $x$, what is their Jaccard similarity?
Find an input sequence for the constant-space majority-finding algorithm that has only three different keys, each repeated a different number of times, such that the key returned by the algorithm is the one that appears the least number of times in your sequence. (For this question, you should use the version of majority-finding that maintains only a single key and its count.)
Suppose that we wish to process a stream of pairs of keys, with each pair consisting of a left key and a right key. Each key can appear only once as a left key, and only once as a right key, but it is possible to have a "double key": a key that appears both as a left key in one pair and as a right key in one pair (not necessarily the same pair as the one for which it is the left key). Our goal is to accurately estimate the number of double keys (achieving an estimate that is accurate to within a factor $(1+\epsilon)$ of the true number, with probability at least $1-\delta$, for given parameters $\epsilon$ and $\delta$). Which of the four streaming algorithms we have discussed (invertible Bloom filters, majority-finding with multiple keys and counts, the count-min-sketch, or MinHash) would be the most suitable to use for this problem? Describe how to use your chosen algorithm to solve the problem efficiently.