One of the central problems in data-analysis is k-means clustering. We study the problem under the streaming variant, in which we may read the input only once and must use only a small amount of memory. Our result uses the assumption of data separability, which closely reflects how k-means is used in practice. We show a near-optimal streaming approximation algorithm for k-means with sublinear memory and a single pass, under the data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance. We also show several related results, including a method to convert any constant approximation to be more accurate on well-clusterable data.
(Joint work with Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, and Brian Tagiku, to appear at SODA 2011.)