Rejection via Learning Density Ratios (2405.18686v2)
Abstract: Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. This can be formalized via the optimization of a loss's risk with a $\varphi$-divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our $\varphi$-divergences are specified by the family of $\alpha$-divergence. Our framework is tested empirically over clean and noisy datasets.
- A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
- S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
- Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
- Learning from noisy examples. Machine learning, 2:343–370, 1988.
- A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, page 3, 2013.
- Uci machine learning repository, 2007.
- Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.
- Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
- (f,Γ)𝑓Γ(f,\Gamma)( italic_f , roman_Γ )-Divergences: interpolating between f𝑓fitalic_f-divergences and integral probability metrics. The Journal of Machine Learning Research, 23(1):1816–1885, 2022.
- Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019.
- Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
- Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
- Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. Advances in Neural Information Processing Systems, 35:521–534, 2022.
- Classification with rejection based on cost-sensitive classification. In International Conference on Machine Learning, pages 1507–1517. PMLR, 2021.
- C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
- Boosting with abstention. Advances in Neural Information Processing Systems, 29, 2016a.
- Learning with rejection. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 67–82. Springer, 2016b.
- Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85–108, 1964.
- Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
- Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
- Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1):2450–2504, 2019.
- Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969, 2021.
- A framework for robustness certification of smoothed classifiers using f-divergences. In International Conference on Learning Representations, 2019.
- Bradley Efron. Exponential families in theory and practice. Cambridge University Press, 2022.
- Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953.
- Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.
- Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
- Support vector machines with a reject option. Advances in neural information processing systems, 21, 2008.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
- Hisham Husain. Distributional robustness with ipms and links to regularization and gans. Advances in Neural Information Processing Systems, 33:11816–11827, 2020.
- Adversarial interpretation of bayesian inference. In International Conference on Algorithmic Learning Theory, pages 553–572. PMLR, 2022.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference. The Journal of Machine Learning Research, 23(1):5789–5897, 2022.
- Donald E Knuth. Two notes on notation. The American Mathematical Monthly, 99(5):403–422, 1992.
- Henry Lam. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275, 2016.
- Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Double ramp loss based reject option classifier. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 151–163. Springer, 2015.
- Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pages 822–867. PMLR, 2024.
- Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087. PMLR, 2020.
- Who should predict? exact algorithms for learning to defer to humans. In International Conference on Artificial Intelligence and Statistics, pages 10520–10545. PMLR, 2023.
- Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Post-hoc estimators for learning to defer to an expert. Advances in Neural Information Processing Systems, 35:29292–29304, 2022.
- On the calibration of multiclass classification with rejection. Advances in Neural Information Processing Systems, 32, 2019.
- John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Information theory: From coding to learning. Book draft, 2022.
- Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12:530–554, 2018. URL https://api.semanticscholar.org/CorpusID:126332033.
- Composite binary losses. The Journal of Machine Learning Research, 11:2387–2422, 2010.
- Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731–817, 2011.
- f𝑓fitalic_f-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
- Herbert E Scarf. A min-max solution of an inventory problem. Technical report, RAND CORP SANTA MONICA CALIF, 1957.
- Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, pages 9131–9141, 2019.
- Density ratio estimation in machine learning. Cambridge University Press, 2012.
- Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
- Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.
- Regression with reject option and application to knn. Advances in Neural Information Processing Systems, 33:20073–20082, 2020.
- Arnold Zellner. Optimal information processing and bayes’s theorem. American Statistician, pages 278–280, 1988.
- Coping with label shift via distributionally robust optimisation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=BtZhsSGNRNi.
- Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.