Simple Weak Coresets for Non-Decomposable Classification Measures
Abstract: While coresets have been growing in terms of their application, barring few exceptions, they have mostly been limited to unsupervised settings. We consider supervised classification problems, and non-decomposable evaluation measures in such settings. We show that stratified uniform sampling based coresets have excellent empirical performance that are backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size, and have good empirical performance, comparable with ``smarter'' coreset construction strategies.
- Geometric approximation via coresets. Combinatorial and computational geometry, 52(1): 1–30.
- Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476.
- Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1119–1127.
- Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20.
- sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. Transactions on Machine Learning Research.
- Blackard, J. 1998. Covertype. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C50K5N.
- The power of uniform sampling for coresets. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 462–473. IEEE.
- New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889.
- Lp row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, 183–192.
- Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 1127–1136.
- Scalable learning of non-decomposable objectives. In Artificial intelligence and statistics, 832–840. PMLR.
- Feldman, D. 2020. Core-sets: Updated survey. Sampling Techniques for Supervised or Unsupervised Tasks, 23–44.
- A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, 569–578.
- Joachims, T. 2005. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, 377–384.
- Online and stochastic gradient methods for non-decomposable loss functions. Advances in Neural Information Processing Systems, 27.
- Universal ε𝜀\varepsilonitalic_ε-approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, 598–607. SIAM.
- Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62(3): 516–527.
- A Coreset Learning Reality Check. arXiv preprint arXiv:2301.06163.
- Coresets for classification–simplified and strengthened. Advances in Neural Information Processing Systems, 34: 11643–11654.
- On coresets for logistic regression. Advances in Neural Information Processing Systems, 31.
- Optimizing F-measure: A tale of two approaches. arXiv preprint arXiv:1206.4625.
- Optimizing non-decomposable performance measures: A tale of two classes. In International Conference on Machine Learning, 199–208. PMLR.
- Low-shot validation: Active importance sampling for estimating classifier performance on rare categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10705–10714.
- Unconditional coresets for regularized loss minimization. In International Conference on Artificial Intelligence and Statistics, 482–492. PMLR.
- Optimizing non-decomposable measures with deep networks. Machine Learning, 107: 1597–1620.
- Active estimation of f-measures. Advances in Neural Information Processing Systems, 23.
- KDD Cup 1999 Data. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C51C7N.
- On coresets for support vector machines. Theoretical Computer Science, 890: 171–191.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.