2000 character limit reached
Fast and Accurate $k$-means++ via Rejection Sampling (2012.11891v1)
Published 22 Dec 2020 in cs.LG and cs.DS
Abstract: $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality.
- Vincent Cohen-Addad (88 papers)
- Silvio Lattanzi (47 papers)
- Ashkan Norouzi-Fard (24 papers)
- Christian Sohler (27 papers)
- Ola Svensson (55 papers)