CompSci 261, Spring 2017, Homework 4

Suppose that we wish to process a stream of pairs of keys, with each pair consisting of a left key and a right key. Each key can appear only once as a left key, and only once as a right key, but it is possible to have a "double key": a key that appears both as a left key in one pair and as a right key in one pair (not necessarily the same pair as the one for which it is the left key). Our goal is to accurately estimate the number of double keys (achieving an estimate that is accurate to within a factor $(1+\epsilon)$ of the true number, with probability at least $1-\delta$, for given parameters $\epsilon$ and $\delta$). Which of the four streaming algorithms we have discussed (invertible Bloom filters, majority-finding with multiple keys and counts, the count-min-sketch, or MinHash) would be the most suitable to use for this problem? Describe how to use your chosen algorithm to solve the problem efficiently.

Solution: MinHash. Let $L$ and $R$ be the sets of left and right keys seen so far, and maintain the MinHash sketches of $L$ and $R$. Also maintain the number $n$ of key-pairs seen so far. This gives an accurate approximation to the Jaccard similarity, but the number of double keys that we want is actually $n$ times the cosine similarity, so invert the answer to question (2) to convert estimates of the Jaccard similarity into estimates of the cosine similarity.

The question didn't ask for the analysis of how efficient your solution is, and I think mis-stated the accuracy that could be achieved. But to achieve probability at least $1-\delta$ of getting an estimate accurate to an additive error of $\pm\epsilon n$ (not a multiplicative error of $1+\epsilon$) the MinHash sketches should use sample size

\[ k=\Theta\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right). \]

It's tempting to instead try to solve this problem by using a count-min sketch for $L$ and $R$, since the number of double keys that we want is just the dot product of 0-1 vectors representing membership in the two sets. However, because the intersection can be much smaller than the sets themselves, this method appears to require much more space than MinHash. In particular to achieve additive error $\pm\epsilon n$ (the same one as above), with the error bounds on dot products given in the lecture, we would need a sketch of size $\Theta(n)$, not an improvement over just storing the sets explicitly.