Top-$k$ Classification and Cardinality-Aware Prediction (2403.19625v1)
Abstract: We present a detailed study of top-$k$ classification, the task of predicting the $k$ most probable classes for an input, extending beyond single-class prediction. We demonstrate that several prevalent surrogate loss functions in multi-class classification, such as comp-sum and constrained losses, are supported by $H$-consistency bounds with respect to the top-$k$ loss. These bounds guarantee consistency in relation to the hypothesis set $H$, providing stronger guarantees than Bayes-consistency due to their non-asymptotic and hypothesis-set specific nature. To address the trade-off between accuracy and cardinality $k$, we further introduce cardinality-aware loss functions through instance-dependent cost-sensitive learning. For these functions, we derive cost-sensitive comp-sum and constrained surrogate losses, establishing their $H$-consistency bounds and Bayes-consistency. Minimizing these losses leads to new cardinality-aware algorithms for top-$k$ classification. We report the results of extensive experiments on CIFAR-100, ImageNet, CIFAR-10, and SVHN datasets demonstrating the effectiveness and benefit of these algorithms.
- H𝐻Hitalic_H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pages 1117–1174, 2022a.
- Multi-class H𝐻Hitalic_H-consistency bounds. In Advances in neural information processing systems, pages 782–795, 2022b.
- Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.
- Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Joseph Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357––365, 1944.
- Joseph Berkson. Why I prefer logits to probits. Biometrics, 7(4):327––339, 1951.
- Smooth loss functions for deep top-k classification. In International Conference on Learning Representations, 2018.
- On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265–292, 2001.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Confidence sets with expected sizes for multiclass classification. Journal of Machine Learning Research, 18(102):1–28, 2017.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.
- Multi-class deep boosting. In Advances in Neural Information Processing Systems, pages 2501–2509, 2014.
- Top-k multiclass svm. In Advances in neural information processing systems, 2015.
- Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1468–1477, 2016.
- Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40(07):1533–1554, 2018.
- Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
- Consistency versus realizable H-consistency for multiclass classification. In International Conference on Machine Learning, pages 801–809, 2013.
- Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.
- H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.
- H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023c.
- Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023d.
- Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023e.
- Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023f.
- Principled approaches for learning to defer with multiple experts. In International Symposium on Artificial Intelligence and Mathematics, 2024a.
- Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In Algorithmic Learning Theory, 2024b.
- Theoretically grounded loss functions and algorithms for score-based multi-class abstention. In International Conference on Artificial Intelligence and Statistics, 2024c.
- Learning to reject with a fixed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.
- Foundations of machine learning. MIT press, 2018.
- A theory of multiclass boosting. Journal of Machine Learning Research, 2013.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
- Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Stochastic negative mining for learning with large output spaces. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1940–1949, 2019.
- Multiclass boosting: Theory and algorithms. Advances in neural information processing systems, 24, 2011.
- Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
- Consistent polyhedral surrogates for top-k classification and variants. In International Conference on Machine Learning, pages 21329–21359, 2022.
- Ranking with ordered weighted pairwise classification. In International conference on machine learning, pages 1057–1064, 2009.
- Pierre François Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113––121, 1838.
- Pierre François Verhulst. Recherches mathématiques sur la loi d’accroissement de la population. Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1––42, 1845.
- Multi-class support vector machines. Technical report, Citeseer, 1998.
- On the consistency of top-k surrogate losses. In International Conference on Machine Learning, pages 10727–10735, 2020.
- Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004a.
- Tong Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225–1251, 2004b.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.
- Revisiting discriminative vs. generative classifiers: Theory and implications. In International Conference on Machine Learning, 2023.