Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rejection via Learning Density Ratios (2405.18686v2)

Published 29 May 2024 in stat.ML and cs.LG

Abstract: Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. This can be formalized via the optimization of a loss's risk with a $\varphi$-divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our $\varphi$-divergences are specified by the family of $\alpha$-divergence. Our framework is tested empirically over clean and noisy datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
  2. S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
  3. Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
  4. Learning from noisy examples. Machine learning, 2:343–370, 1988.
  5. A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, page 3, 2013.
  6. Uci machine learning repository, 2007.
  7. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.
  8. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
  9. (f,Γ)𝑓Γ(f,\Gamma)( italic_f , roman_Γ )-Divergences: interpolating between f𝑓fitalic_f-divergences and integral probability metrics. The Journal of Machine Learning Research, 23(1):1816–1885, 2022.
  10. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019.
  11. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
  12. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
  13. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. Advances in Neural Information Processing Systems, 35:521–534, 2022.
  14. Classification with rejection based on cost-sensitive classification. In International Conference on Machine Learning, pages 1507–1517. PMLR, 2021.
  15. C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
  16. Boosting with abstention. Advances in Neural Information Processing Systems, 29, 2016a.
  17. Learning with rejection. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 67–82. Springer, 2016b.
  18. Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85–108, 1964.
  19. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
  20. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
  21. Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1):2450–2504, 2019.
  22. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969, 2021.
  23. A framework for robustness certification of smoothed classifiers using f-divergences. In International Conference on Learning Representations, 2019.
  24. Bradley Efron. Exponential families in theory and practice. Cambridge University Press, 2022.
  25. Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953.
  26. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.
  27. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
  28. Support vector machines with a reject option. Advances in neural information processing systems, 21, 2008.
  29. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  30. Hisham Husain. Distributional robustness with ipms and links to regularization and gans. Advances in Neural Information Processing Systems, 33:11816–11827, 2020.
  31. Adversarial interpretation of bayesian inference. In International Conference on Algorithmic Learning Theory, pages 553–572. PMLR, 2022.
  32. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  33. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference. The Journal of Machine Learning Research, 23(1):5789–5897, 2022.
  35. Donald E Knuth. Two notes on notation. The American Mathematical Monthly, 99(5):403–422, 1992.
  36. Henry Lam. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275, 2016.
  37. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  38. Double ramp loss based reject option classifier. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 151–163. Springer, 2015.
  39. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pages 822–867. PMLR, 2024.
  40. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087. PMLR, 2020.
  41. Who should predict? exact algorithms for learning to defer to humans. In International Conference on Artificial Intelligence and Statistics, pages 10520–10545. PMLR, 2023.
  42. Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
  43. Post-hoc estimators for learning to defer to an expert. Advances in Neural Information Processing Systems, 35:29292–29304, 2022.
  44. On the calibration of multiclass classification with rejection. Advances in Neural Information Processing Systems, 32, 2019.
  45. John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  46. Information theory: From coding to learning. Book draft, 2022.
  47. Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12:530–554, 2018. URL https://api.semanticscholar.org/CorpusID:126332033.
  48. Composite binary losses. The Journal of Machine Learning Research, 11:2387–2422, 2010.
  49. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731–817, 2011.
  50. f𝑓fitalic_f-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
  51. Herbert E Scarf. A min-max solution of an inventory problem. Technical report, RAND CORP SANTA MONICA CALIF, 1957.
  52. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
  53. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  54. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, pages 9131–9141, 2019.
  55. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  56. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
  57. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.
  58. Regression with reject option and application to knn. Advances in Neural Information Processing Systems, 33:20073–20082, 2020.
  59. Arnold Zellner. Optimal information processing and bayes’s theorem. American Statistician, pages 278–280, 1988.
  60. Coping with label shift via distributionally robust optimisation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=BtZhsSGNRNi.
  61. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com