Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection (2402.14332v2)

Published 22 Feb 2024 in cs.LG and stat.ML

Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. k-means++: the advantages of careful seeding. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2007.
  2. Representation learning for clustering: a statistical framework. UAI, 2015.
  3. Clustering with same-cluster queries. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 29, 2016.
  4. Approximate k-means++ in sublinear time. In AAAI Conference on Artificial Intelligence, 2016.
  5. Noise thresholds for spectral clustering. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2011.
  6. Maria-Florina Balcan. Data-driven algorithm design. Beyond the worst-case analysis of algorithms. Beyond the Worst-Case Analysis of Algorithms, edited by Tim Roughgarden, 2020.
  7. Clustering under approximation stability. Journal of the ACM (JACM), 60(2):1–34, 2013.
  8. Robust hierarchical clustering. The Journal of Machine Learning Research, 15, 2014.
  9. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. In Conference on Learning Theory (COLT), 2017.
  10. Learning to link. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  11. A probabilistic framework for semi-supervised clustering. In KDD, 2004.
  12. Shai Ben-David. Clustering — what both theoreticians and practitioners are doing wrong. In AAAI Conference on Artificial Intelligence, 2018.
  13. Avrim Blum. Thoughts on clustering. In NIPS Workshop on Clustering Theory, 2009.
  14. Supporting ground-truth annotation of image datasets using clustering. In Proceedings of the 21st International Conference on Pattern Recognition, 2012.
  15. Haiyan Cai. Exact bound for the convergence of metropolis chains. Stochastic Analysis and Applications, 2000.
  16. A meta-heuristic factory for vehicle routing problems. In Proceedings of the 5th International Conference on Principles and Practice of Constraint Programming, 1999.
  17. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory, 2014.
  18. Spectral clustering: a semi-supervised approach. Neurocomputing, 2012.
  19. Improved coresets for Euclidean k-means. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  20. Learning combinatorial optimization algorithms over graphs. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2017.
  21. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 2005.
  22. Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  23. Applications of dynamic feature selection and clustering methods to medical diagnosis. In Applied Soft Computing, 2022.
  24. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018.
  25. When a worse approximation factor gives better performance: a 3-approximation algorithm for the vertex k-center problem. In Journal of Heuristics, 2017.
  26. Exact combinatorial optimization with graph convolutional neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
  27. Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. In Theoretical Computer Science, 1985.
  28. A PAC approach to application-specific algorithm selection. In SIAM Journal on Computing, 2017.
  29. On coresets for k-means and k-median clustering. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2004.
  30. Learning the travelling salesperson problem requires rethinking generalization. Constraints, 27(1-2):70–98, 2022.
  31. Efficient active algorithms for hierarchical clustering. In International Conference on Machine Learning (ICML), 2012.
  32. Semi-supervised graph clustering: a kernel approach. In International Conference on Machine Learning (ICML), 2005.
  33. Empirical hardness models: Methodology and a case study on combinatorial auctions. In Journal of the ACM, 2009.
  34. Scikit-learn: Machine learning in Python. In Journal of Machine Learning Research, 2011.
  35. A cluster-then-label semi-supervised learning approach for pathology image classification. Scientific reports, 8, 2018.
  36. John R. Rice. The algorithm selection problem. In Advances in Computers, 1976.
  37. Correlation clustering with same-cluster queries bounded by optimal cost. arXiv preprint arXiv:1908.04976, 2019.
  38. Neural execution of graph algorithms. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  39. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning (ICML), pages 2081–2090, 2016.
  40. Efficient clustering with limited distance information. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010.
  41. Clustering: Science or art? In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012.
  42. SATzilla: portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research, 32(1):565–606, 2008.
  43. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets