Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Analysis of $D^α$ seeding for $k$-means (2310.13474v1)

Published 20 Oct 2023 in cs.DS and cs.LG

Abstract: One of the most popular clustering algorithms is the celebrated $D\alpha$ seeding algorithm (also know as $k$-means++ when $\alpha=2$) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an $O(2{2\alpha}\cdot \log k)$-approximate solution to the ($k$,$\alpha$)-means cost (where euclidean distances are raised to the power $\alpha$) for any $\alpha\ge 1$. More recently, Balcan, Dick, and White (2018) observed experimentally that using $D\alpha$ seeding with $\alpha>2$ can lead to a better solution with respect to the standard $k$-means objective (i.e. the $(k,2)$-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any $\alpha>2$, we show that $D\alpha$ seeding guarantees in expectation an approximation factor of $$ O_\alpha \left((g_\alpha){2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right){2-4/\alpha}\cdot (\min{\ell,\log k}){2/\alpha}\right)$$ with respect to the standard $k$-means cost of any underlying clustering; where $g_\alpha$ is a parameter capturing the concentration of the points in each cluster, $\sigma_{\mathrm{max}}$ and $\sigma_{\mathrm{min}}$ are the maximum and minimum standard deviation of the clusters around their means, and $\ell$ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of $2$). We complement these results by some lower bounds showing that the dependency on $g_\alpha$ and $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using $D\alpha$ seeding. Further, we corroborate the observation that $\alpha>2$ can indeed improve the $k$-means cost compared to $D2$ seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Clusterability: A theoretical study. In Artificial intelligence and statistics, pages 1–8. PMLR, 2009.
  2. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 12th International Workshop, APPROX 2009, and 13th International Workshop, RANDOM 2009, Berkeley, CA, USA, August 21-23, 2009. Proceedings, pages 15–28. Springer, 2009.
  3. The application of unsupervised clustering methods to alzheimer’s disease. Frontiers in computational neuroscience, 13:31, 2019.
  4. Learning mixtures of separated nonspherical gaussians. The Annals of Applied Probability, 15(1A):69–92, 2005.
  5. How slow is the k-means method? In Proceedings of the twenty-second annual symposium on Computational geometry, pages 144–153, 2006.
  6. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007.
  7. Emil Artin. The gamma function. Courier Dover Publications, 2015.
  8. Distributed and provably good seedings for k-means in constant rounds. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 292–300. PMLR, 2017.
  9. Scalable k-means++. arXiv preprint arXiv:1203.6402, 2012.
  10. Data-driven clustering via parameterized lloyd’s families. Advances in Neural Information Processing Systems, 31, 2018.
  11. K-means cluster analysis for image segmentation. International Journal of Computer Applications, 96(4), 2014.
  12. An intelligent market segmentation system using k-means and particle swarm optimization. Expert systems with applications, 36(3):4558–4565, 2009.
  13. k-means++: few more steps yield constant approximation. In International Conference on Machine Learning, pages 1909–1917. PMLR, 2020.
  14. Fast and accurate $k$-means++ via rejection sampling. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  15. Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
  16. Sanjoy Dasgupta. Algorithms for k-means clustering, 2013.
  17. k-means advantages and disadvantages. https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages?hl=en. Accessed: 2023-10-06.
  18. A nearly tight analysis of greedy k-means++. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 1012–1070. SIAM, 2023.
  19. Inequalities. Cambridge university press, 1952.
  20. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662–3671. PMLR, 2019.
  21. The planar k-means problem is np-hard. In International workshop on algorithms and computation, pages 274–285. Springer, 2009.
  22. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33:16142–16152, 2020.
  23. Traffic anomaly detection using k-means clustering. In Gi/itg workshop mmbnet, volume 7, 2007.
  24. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22, 2013.
  25. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  26. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  27. Roman Vershynin. High-dimensional probability. University of California, Irvine, 2020.
  28. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. Advances in neural information processing systems, 29, 2016.
  29. Standardized moment. https://en.wikipedia.org/wiki/Standardized_moment. Accessed: 2023-10-07.
  30. Student-t distribtuion. https://en.wikipedia.org/wiki/Student%27s_t-distribution. Accessed: 2023-10-07.

Summary

We haven't generated a summary for this paper yet.