Theoretically Grounded Loss Functions and Algorithms for Score-Based Multi-Class Abstention (2310.14770v2)
Abstract: Learning with abstention is a key scenario where the learner can abstain from making a prediction at some cost. In this paper, we analyze the score-based formulation of learning with abstention in the multi-class classification setting. We introduce new families of surrogate losses for the abstention loss function, which include the state-of-the-art surrogate losses in the single-stage setting and a novel family of loss functions in the two-stage setting. We prove strong non-asymptotic and hypothesis set-specific consistency guarantees for these surrogate losses, which upper-bound the estimation error of the abstention loss function in terms of the estimation error of the surrogate loss. Our bounds can help compare different score-based surrogates and guide the design of novel abstention algorithms by minimizing the proposed surrogate losses. We experimentally evaluate our new algorithms on CIFAR-10, CIFAR-100, and SVHN datasets and the practical significance of our new surrogate losses and two-stage abstention algorithms. Our results also show that the relative performance of the state-of-the-art score-based surrogate losses can vary across datasets.
- Learning with labeling induced abstentions. In Advances in Neural Information Processing, pages 12576–12586, 2021.
- Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, 2021a.
- On the existence of the adversarial bayes classifier. In Advances in Neural Information Processing Systems, pages 2978–2990, 2021b.
- A finer calibration analysis for adversarial robustness. arXiv preprint arXiv:2105.01550, 2021c.
- ℋℋ{\mathscr{H}}script_H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, 2022a.
- Multi-class ℋℋ{\mathscr{H}}script_H-consistency bounds. In Advances in neural information processing systems, 2022b.
- Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pages 10077–10094, 2023.
- DC-programming for neural network optimizations. Journal of Global Optimization, pages 1–17, 2024.
- Is the most accurate ai the best teammate? optimizing ai for teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11405–11414, 2021.
- Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.
- Joseph Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357––365, 1944.
- Joseph Berkson. Why I prefer logits to probits. Biometrics, 7(4):327––339, 1951.
- Kernel based rejection method for supervised classification. In WASET, 2007.
- Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Advances in neural information processing systems, 2022.
- In defense of softmax parametrization for calibrated and consistent learning to defer. In Advances in Neural Information Processing Systems, 2023.
- Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
- Classification with rejection based on cost-sensitive classification. In International Conference on Machine Learning, pages 1507–1517, 2021.
- Learning to make adherence-aware advice. In International Conference on Learning Representations, 2024.
- Regression with cost-based rejection. In Advances in Neural Information Processing Systems, 2023.
- C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
- C.K. Chow. An optimum character recognition system using decision function. IEEE T. C., 1957.
- Set-valued classification–overview via a unified framework. arXiv preprint arXiv:2102.12318, 2021.
- Learning with rejection. In International Conference on Algorithmic Learning Theory, pages 67–82, 2016a.
- Boosting with abstention. In Advances in Neural Information Processing Systems, pages 1660–1668, 2016b.
- Theory and algorithms for learning with rejection in binary classification. Annals of Mathematics and Artificial Intelligence, pages 1–39, 2023.
- On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265–292, 2001.
- Consistency of plug-in confidence sets for classification in semi-supervised learning. Journal of Nonparametric Statistics, 32(1):42–72, 2020.
- Active learning algorithm through the lens of rejection arguments. arXiv preprint arXiv:2208.14682, 2022.
- A statistical decision rule with incomplete knowledge about classes. Pattern recognition, 26(1):155–165, 1993.
- Active learning via perfect selective classification. Journal of Machine Learning Research, 13(2), 2012.
- Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010.
- Charles Elkan. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, pages 973–978, 2001.
- Katja Filippova. Controlled hallucinations:learning to generate faithfully from noisy data. In Findings of EMNLP 2020, 2020.
- Support vector machines with embedded reject option. In ICPR, 2002.
- Multiple reject thresholds for improving classification reliability. In ICAPR, 2000.
- Selective classification via one-sided prediction. In International Conference on Artificial Intelligence and Statistics, pages 2179–2187, 2021.
- Selective classification for deep neural networks. In Advances in neural information processing systems, 2017.
- Selectivenet: A deep neural network with an integrated reject option. In International conference on machine learning, pages 2151–2159, 2019.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Suppport vector machines with a reject option. In NIPS, 2008.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Classification with reject option. Can. J. Stat., 2005.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.
- Interaction between classification and reject performance for distance-based reject-option classifiers. PRL, 2005.
- An optimum class-rejective decision rule and its evaluation. In International Conference on Pattern Recognition, pages 3312–3315, 2010.
- Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
- Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014.
- Knows what it knows: a framework for self-aware learning. In International conference on Machine learning, pages 568–575, 2008.
- When no-rejection learning is optimal for regression with rejection. In International Conference on Artificial Intelligence and Statistics, 2024.
- Consistency versus realizable H-consistency for multiclass classification. In International Conference on Machine Learning, pages 801–809, 2013.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Predict responsibly: improving fairness and accuracy by learning to defer. In Advances in Neural Information Processing Systems, 2018.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Two-stage learning to defer with multiple experts. In Advances in Neural Information Processing Systems, 2023a.
- H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.
- Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, 2023c.
- Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. arXiv preprint, 2023d.
- H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023e.
- Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023f.
- Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023g.
- Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023h.
- Principled approaches for learning to defer with multiple experts. In International Symposium on Artificial Intelligence and Mathematics, 2024a.
- Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, 2024b.
- ℋℋ{\mathscr{H}}script_H-consistency guarantees for regression. arXiv preprint arXiv:2403.19480, 2024c.
- Regression with multi-expert deferral. arXiv preprint arXiv:2403.19494, 2024d.
- Top-k𝑘kitalic_k classification and cardinality-aware prediction. arXiv preprint arXiv:2403.19625, 2024e.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.173.
- Combining classifiers for improved classification of proteins from sequence or structure. BMCB, 2008.
- Learning to reject with a fixed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.
- Foundations of Machine Learning. MIT Press, second edition, 2018.
- Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087, 2020.
- Who should predict? exact algorithms for learning to defer to humans. In International Conference on Artificial Intelligence and Statistics, pages 10520–10545, 2023.
- Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems, 2022.
- Learning to reject meets ood detection: Are all abstentions created equal? arXiv preprint arXiv:2301.12386, 2023.
- Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Dokl. akad. nauk Sssr, 269:543–547, 1983.
- Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.
- On the calibration of multiclass classification with rejection. In Advances in Neural Information Processing Systems, pages 2582–2592, 2019.
- Differentiable learning under triage. Advances in Neural Information Processing Systems, 34:9140–9151, 2021.
- On optimal reject rules and ROC curves. PRL, 2005.
- Tadeusz Pietraszek. Optimizing abstaining classifiers using ROC. In ICML, 2005.
- Exponential savings in agnostic active learning through abstention. In Conference on Learning Theory, pages 3806–3832, 2021.
- The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220, 2019a.
- Direct uncertainty prediction for medical second opinions. In International Conference on Machine Learning, pages 5281–5290, 2019b.
- Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12(1):530–554, 2018.
- Composite binary losses. The Journal of Machine Learning Research, 11:2387–2422, 2010.
- Classification with abstention but without disparities. In Uncertainty in Artificial Intelligence, pages 1227–1236. PMLR, 2021.
- Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
- Growing a multi-class classifier with a reject option. Pattern Recognition Letters, 29(10):1565–1570, 2008.
- Francesco Tortorella. An optimal reject rule for binary classifiers. In ICAPR, 2001.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Pierre François Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113––121, 1838.
- Pierre François Verhulst. Recherches mathématiques sur la loi d’accroissement de la population. Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1––42, 1845.
- Calibrated learning to defer with one-vs-all classifiers. In International Conference on Machine Learning, pages 22184–22202, 2022.
- Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics, pages 11415–11434, 2023.
- Multi-class support vector machines. Technical report, Citeseer, 1998.
- Agnostic selective classification. In Advances in neural information processing systems, 2011.
- Agnostic pointwise-competitive selective classification. Journal of Artificial Intelligence Research, 52:171–201, 2015.
- A compression technique for analyzing disagreement-based active learning. J. Mach. Learn. Res., 16:713–745, 2015.
- Learning to complement humans. In International Joint Conferences on Artificial Intelligence, pages 1526–1533, 2021.
- Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.
- SVMs with a reject option. In Bernoulli, 2011.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- The extended Littlestone’s dimension for learning with mistakes and abstentions. In Conference on Learning Theory, 2016a.
- The extended littlestone’s dimension for learning with mistakes and abstentions. In Conference on Learning Theory, pages 1584–1616, 2016b.
- Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, 2020.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.
- Revisiting discriminative vs. generative classifiers: Theory and implications. arXiv preprint arXiv:2302.02334, 2023.
- Efficient active learning with abstention. arXiv preprint arXiv:2204.00043, 2022.
- Deep gamblers: Learning to abstain with portfolio theory. arXiv preprint arXiv:1907.00208, 2019.