Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data (2404.01936v1)

Published 2 Apr 2024 in cs.LG and cs.DS

Abstract: We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Streamkm++: A clustering algorithm for data streams. ACM J. Exp. Algorithmics, 17(1), 2012. doi: 10.1145/2133803.2184450. URL https://doi.org/10.1145/2133803.2184450.
  2. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027–1035, 2007. URL http://dl.acm.org/citation.cfm?id=1283383.1283494.
  3. Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Cambridge, MA, USA, August 15-17, 2012. Proceedings, pages 37–49, 2012. doi: 10.1007/978-3-642-32512-0_4. URL http://dx.doi.org/10.1007/978-3-642-32512-0_4.
  4. Fast and provably good seedings for k-means. Advances in neural information processing systems, 29, 2016a.
  5. Approximate k-means++ in sublinear time. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 1459–1467. AAAI Press, 2016b. doi: 10.1609/AAAI.V30I1.10259. URL https://doi.org/10.1609/aaai.v30i1.10259.
  6. Scalable k-means clustering via lightweight coresets. In Yike Guo and Faisal Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 1119–1127. ACM, 2018. doi: 10.1145/3219819.3219973. URL https://doi.org/10.1145/3219819.3219973.
  7. Approximate clustering via core-sets. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 250–257, 2002. doi: 10.1145/509907.509947. URL https://doi.org/10.1145/509907.509947.
  8. On coresets for fair clustering in metric and euclidean spaces and their applications. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. doi: 10.4230/LIPIcs.ICALP.2021.23. URL https://doi.org/10.4230/LIPIcs.ICALP.2021.23.
  9. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1039–1050, 2019. doi: 10.1145/3313276.3316318. URL https://doi.org/10.1145/3313276.3316318.
  10. Shai Ben-David. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering. Mach. Learn., 66(2-3):243–257, 2007. doi: 10.1007/s10994-006-0587-3. URL https://doi.org/10.1007/s10994-006-0587-3.
  11. Decomposable searching problems I: static-to-dynamic transformation. J. Algorithms, 1(4):301–358, 1980. doi: 10.1016/0196-6774(80)90015-2. URL https://doi.org/10.1016/0196-6774(80)90015-2.
  12. The million song dataset. pages 591–596, 2011. doi: 10.7916/D8NZ8J07. URL http://ismir2011.ismir.net/papers/OS6-1.pdf.
  13. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131–151, 1999. doi: 10.1016/S0168-1699(99)00046-0. URL https://doi.org/10.1016/S0168-1699(99)00046-0.
  14. Coresets for clustering in excluded-minor graphs and beyond. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2679–2696. SIAM, 2021. doi: 10.1137/1.9781611976465.159. URL https://doi.org/10.1137/1.9781611976465.159.
  15. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 462–473. IEEE, 2022. doi: 10.1109/FOCS54457.2022.00051. URL https://doi.org/10.1109/FOCS54457.2022.00051.
  16. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi: 10.1137/070699007. URL https://doi.org/10.1137/070699007.
  17. Correlation clustering in mapreduce. In Sofus A. Macskassy, Claudia Perlich, Jure Leskovec, Wei Wang, and Rayid Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 641–650. ACM, 2014. doi: 10.1145/2623330.2623743. URL https://doi.org/10.1145/2623330.2623743.
  18. Fair clustering through fairlets. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5029–5037, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
  19. Dimensionality reduction for k-means clustering and low rank approximation. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 163–172. ACM, 2015. doi: 10.1145/2746539.2746569. URL https://doi.org/10.1145/2746539.2746569.
  20. Geometric median in nearly linear time. In Daniel Wichs and Yishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 9–21. ACM, 2016. doi: 10.1145/2897518.2897647. URL https://doi.org/10.1145/2897518.2897647.
  21. On the fixed-parameter tractability of capacitated clustering. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, pages 41:1–41:14, 2019. doi: 10.4230/LIPIcs.ICALP.2019.41. URL https://doi.org/10.4230/LIPIcs.ICALP.2019.41.
  22. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49–60, 2017. doi: 10.1109/FOCS.2017.14. URL https://doi.org/10.1109/FOCS.2017.14.
  23. Fast and accurate k𝑘kitalic_k-means++ via rejection sampling. Advances in Neural Information Processing Systems, 33:16235–16245, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/babcff88f8be8c4795bd6f0f8cccca61-Abstract.html.
  24. Parallel and efficient hierarchical k-median clustering. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20333–20345, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/aa495e18c7e3a21a4e48923b92048a61-Abstract.html.
  25. A new coreset framework for clustering. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021. ACM, 2021b. doi: 10.1145/3406325.3451022. URL https://doi.org/10.1145/3406325.3451022.
  26. Improved coresets and sublinear algorithms for power means in euclidean spaces. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 21085–21098, 2021c. URL https://proceedings.neurips.cc/paper/2021/hash/b035d6563a2adac9f822940c145263ce-Abstract.html.
  27. Towards optimal lower bounds for k-median and k-means coresets. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1038–1051. ACM, 2022a. doi: 10.1145/3519935.3519946. URL https://doi.org/10.1145/3519935.3519946.
  28. Improved coresets for euclidean k-means. In NeurIPS, 2022b. URL http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
  29. Clustering very large multi-dimensional datasets with mapreduce. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 690–698. ACM, 2011. doi: 10.1145/2020408.2020516. URL https://doi.org/10.1145/2020408.2020516.
  30. Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms, 30(1-2):226–256, 2007. doi: 10.1002/RSA.20157. URL https://doi.org/10.1002/rsa.20157.
  31. A nearly optimal size coreset algorithm with nearly linear time. CoRR, abs/2210.08361, 2022. doi: 10.48550/arXiv.2210.08361. URL https://doi.org/10.48550/arXiv.2210.08361.
  32. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  33. Fast clustering using mapreduce. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 681–689. ACM, 2011. doi: 10.1145/2020408.2020515. URL https://doi.org/10.1145/2020408.2020515.
  34. A sampling-based approach for efficient clustering in large datasets. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 12393–12402. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01208. URL https://doi.org/10.1109/CVPR52688.2022.01208.
  35. A tight bound on approximating arbitrary metrics by tree metrics. In Lawrence L. Larmore and Michel X. Goemans, editors, Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9-11, 2003, San Diego, CA, USA, pages 448–455. ACM, 2003. doi: 10.1145/780542.780608. URL https://doi.org/10.1145/780542.780608.
  36. Dan Feldman. Introduction to core-sets: an updated survey. arXiv preprint arXiv:2011.09384, 2020. doi: 10.1002/widm.1335.
  37. A unified framework for approximating and clustering data. In Lance Fortnow and Salil P. Vadhan, editors, Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pages 569–578. ACM, 2011. doi: 10.1145/1993636.1993712. URL https://doi.org/10.1145/1993636.1993712.
  38. Bico: Birch meets coresets for k-means clustering. In European symposium on Algorithms, pages 481–492. Springer, 2013.
  39. Sariel Har-Peled. Geometric approximation algorithms. Number 173. American Mathematical Soc., 2011. doi: 10.1090/surv/173.
  40. On coresets for k-means and k-median clustering. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291–300. ACM, 2004. doi: 10.1145/1007352.1007400. URL https://doi.org/10.1145/1007352.1007400.
  41. Fast k-means for large scale clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2099–2102, 2017. doi: 10.1145/3132847.3133091.
  42. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 1416–1429. ACM, 2020. doi: 10.1145/3357713.3384296. URL https://doi.org/10.1145/3357713.3384296.
  43. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
  44. Towards optimal coreset construction for (k, z)-clustering: Breaking the quadratic dependency on k. CoRR, abs/2211.11923, 2022. doi: 10.48550/arXiv.2211.11923. URL https://doi.org/10.48550/arXiv.2211.11923.
  45. The power of uniform sampling for k-median. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 13933–13956. PMLR, 2023. URL https://proceedings.mlr.press/v202/huang23j.html.
  46. Clustering with spectral norm and the k-means algorithm. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 299–308, 2010. doi: 10.1109/FOCS.2010.35. URL http://dx.doi.org/10.1109/FOCS.2010.35.
  47. Universal epsilon-approximators for integrals. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 598–607. SIAM, 2010. doi: 10.1137/1.9781611973075.50. URL https://doi.org/10.1137/1.9781611973075.50.
  48. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URL https://doi.org/10.1109/5.726791.
  49. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129–136, 1982. doi: 10.1109/TIT.1982.1056489. URL https://doi.org/10.1109/TIT.1982.1056489.
  50. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1027–1038, 2019. doi: 10.1145/3313276.3316350. URL https://doi.org/10.1145/3313276.3316350.
  51. A k-median algorithm with running time independent of data size. Mach. Learn., 56(1-3):61–87, 2004. doi: 10.1023/B:MACH.0000033115.78247.F0. URL https://doi.org/10.1023/B:MACH.0000033115.78247.f0.
  52. Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C55W25.
  53. N/A, 1990. URL https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990).
  54. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM, 59(6):28, 2012. doi: 10.1145/2395116.2395117. URL http://doi.acm.org/10.1145/2395116.2395117.
  55. Pixabay. Shooting star sky night royalty-free vector graphic, 2023. URL https://pixabay.com/vectors/shooting-star-sky-night-dark-stars-2024127/. [Online; accessed Jan 8th 2024].
  56. Fair coresets and streaming algorithms for fair k-means. In Approximation and Online Algorithms - 17th International Workshop, WAOA 2019, Munich, Germany, September 12-13, 2019, Revised Selected Papers, pages 232–251, 2019. doi: 10.1007/978-3-030-39479-0_16. URL https://doi.org/10.1007/978-3-030-39479-0_16.
  57. An empirical evaluation of k-means coresets. In Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman, editors, 30th Annual European Symposium on Algorithms, ESA 2022, September 5-9, 2022, Berlin/Potsdam, Germany, volume 244 of LIPIcs, pages 84:1–84:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPIcs.ESA.2022.84. URL https://doi.org/10.4230/LIPIcs.ESA.2022.84.
  58. BIRCH: an efficient data clustering method for very large databases. pages 103–114, 1996. doi: 10.1145/233269.233324. URL https://doi.org/10.1145/233269.233324.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com