#
Fast and Accurate k-means for Large Datasets

##
Michael Shindler

Clustering is a popular problem with many applications. We consider
the $k$-means problem in the situation where the data is too large to
be stored in main memory and must be accessed sequentially, such as
from a disk, and where we must use as little memory as possible. Our
algorithm is based on recent theoretical results, with significant
improvements to make it practical. Our approach greatly simplifies a
recently developed algorithm, both in design and in analysis, and
eliminates large constant factors in both the approximation guarantee
and the memory requirements. We then incorporate approximate nearest
neighbor search to compute $k$-means in $o(nk)$ (whereas computing
the cost, given a solution, takes $\Theta(nk)$ time). We show that
our algorithm compares favorably to existing algorithms - both
theoretically and experimentally.