# Streaming *k*-means on Well-Clusterable Data

## Michael Shindler

One of the central problems in data-analysis is *k*-means
clustering. We study the problem under the *streaming* variant,
in which we may read the input only once and must use only a small
amount of memory. Our result uses the assumption of *data
separability*, which closely reflects how *k*-means is used in
practice. We show a near-optimal streaming approximation algorithm for
*k*-means with sublinear memory and a single pass, under the data
separability assumption. Our algorithm offers significant improvements
in both space and running time over previous work while yielding
asymptotically best-possible performance. We also show several related
results, including a method to convert any constant approximation to be
more accurate on well-clusterable data.

(Joint work with Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky,
Alan Roytman, and Brian Tagiku, to appear at SODA 2011.)