# CompSci 261, Spring 2017, Homework 4

1. Suppose that we wish to read a stream of numbers and maintain the average and standard deviation of the numbers. Recall that the average of $n$ numbers $x_1,x_2,\dots x_n$ is $\bar x = \sum_{i=1}^n \frac{x_i}{n}$ and that the (uncorrected) standard deviation is $\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar x)^2}.$

1. Show that both the average and standard deviation can be calculated in constant time from the three values $n$, $p=\sum x_i$, and $q=\sum x_i^2$.

Solution: $\bar x=\frac{p}{n}$ and $\sigma=\sqrt{\frac{q - 2p\bar x + n\bar x^2}{n}}$
2. Use your answer to part (a) to design a streaming data structure that can process each input number in constant time, that can report the average and standard deviation in constant time, and that uses only a constant number of words of memory.

Solution: Maintain $n$, $p$, and $q$ for the values seen so far in the stream. For each new value $x_i$, increase $n$, $p$, and $q$ by (respectively) $1$, $x_i$, and $x_i^2$. To report the average and standard deviation, use the formulas from part (a).
2. Recall that the Jaccard similarity of two sets $A$ and $B$ is $\frac{|A\cap B|}{|A\cup B|}.$ while the cosine similarity (of the 0-1 vectors corresponding to these sets) is $\frac{|A\cap B|}{\sqrt{|A|\cdot|B|}}.$ If two sets $A$ and $B$ have equal sizes and their cosine similarity is $x$, what is their Jaccard similarity?

Solution: $\frac{x}{2-x}.$
3. Find an input sequence for the constant-space majority-finding algorithm that has only three different keys, each repeated a different number of times, such that the key returned by the algorithm is the one that appears the least number of times in your sequence. (For this question, you should use the version of majority-finding that maintains only a single key and its count.)

Solution: AAAABBBCC. The first four of these increase A's count to 4, and the next four decrease the count back down to 0, so that the final C gets stored with a count of 1 and returned as the result of the algorithm.
4. Suppose that we wish to process a stream of pairs of keys, with each pair consisting of a left key and a right key. Each key can appear only once as a left key, and only once as a right key, but it is possible to have a "double key": a key that appears both as a left key in one pair and as a right key in one pair (not necessarily the same pair as the one for which it is the left key). Our goal is to accurately estimate the number of double keys (achieving an estimate that is accurate to within a factor $(1+\epsilon)$ of the true number, with probability at least $1-\delta$, for given parameters $\epsilon$ and $\delta$). Which of the four streaming algorithms we have discussed (invertible Bloom filters, majority-finding with multiple keys and counts, the count-min-sketch, or MinHash) would be the most suitable to use for this problem? Describe how to use your chosen algorithm to solve the problem efficiently.

Solution: MinHash. Let $L$ and $R$ be the sets of left and right keys seen so far, and maintain the MinHash sketches of $L$ and $R$. Also maintain the number $n$ of key-pairs seen so far. This gives an accurate approximation to the Jaccard similarity, but the number of double keys that we want is actually $n$ times the cosine similarity, so invert the answer to question (2) to convert estimates of the Jaccard similarity into estimates of the cosine similarity.

The question didn't ask for the analysis of how efficient your solution is, and I think mis-stated the accuracy that could be achieved. But to achieve probability at least $1-\delta$ of getting an estimate accurate to an additive error of $\pm\epsilon n$ (not a multiplicative error of $1+\epsilon$) the MinHash sketches should use sample size

$k=\Theta\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right).$

It's tempting to instead try to solve this problem by using a count-min sketch for $L$ and $R$, since the number of double keys that we want is just the dot product of 0-1 vectors representing membership in the two sets. However, because the intersection can be much smaller than the sets themselves, this method appears to require much more space than MinHash. In particular to achieve additive error $\pm\epsilon n$ (the same one as above), with the error bounds on dot products given in the lecture, we would need a sketch of size $\Theta(n)$, not an improvement over just storing the sets explicitly.