Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Entropy Loss Functions: Theoretical Analysis and Applications (2304.07288v2)

Published 14 Apr 2023 in cs.LG and stat.ML

Abstract: Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when the softmax is used. But, what guarantees can we rely on when using cross-entropy as a surrogate loss? We present a theoretical analysis of a broad family of loss functions, comp-sum losses, that includes cross-entropy (or logistic loss), generalized cross-entropy, the mean absolute error and other cross-entropy-like loss functions. We give the first $H$-consistency bounds for these loss functions. These are non-asymptotic guarantees that upper bound the zero-one loss estimation error in terms of the estimation error of a surrogate loss, for the specific hypothesis set $H$ used. We further show that our bounds are tight. These bounds depend on quantities called minimizability gaps. To make them more explicit, we give a specific analysis of these gaps for comp-sum losses. We also introduce a new family of loss functions, smooth adversarial comp-sum losses, that are derived from their comp-sum counterparts by adding in a related smooth term. We show that these loss functions are beneficial in the adversarial setting by proving that they admit $H$-consistency bounds. This leads to new adversarial robustness algorithms that consist of minimizing a regularized smooth adversarial comp-sum loss. While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (157)
  1. On consistent surrogate risk minimization and property elicitation. In Conference on Learning Theory, pp.  4–22, 2015.
  2. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems, 2019.
  3. Understanding and improving fast adversarial training. In Advances in Neural Information Processing Systems, pp. 16048–16059, 2020.
  4. Black-box certification and learning under adversarial perturbations. In International Conference on Machine Learning, pp. 388–398, 2020.
  5. Adversarially robust learning of real-valued functions. arXiv preprint arXiv:2206.12977, 2022.
  6. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, pp.  162–183, 2019.
  7. A characterization of semi-supervised adversarially robust pac learnability. In Advances in Neural Information Processing Systems, 2022a.
  8. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897–7927, 2022b.
  9. On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, pp. 13737–13747, 2019.
  10. Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431–441, 2020.
  11. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pp. 9804–9815, 2021a.
  12. On the existence of the adversarial bayes classifier. In Advances in Neural Information Processing Systems, pp. 2978–2990, 2021b.
  13. A finer calibration analysis for adversarial robustness. arXiv preprint arXiv:2105.01550, 2021c.
  14. Multi-class ℋℋ{\mathscr{H}}script_H-consistency bounds. In Advances in neural information processing systems, 2022a.
  15. ℋℋ{\mathscr{H}}script_H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, 2022b.
  16. DC-programming for neural network optimizations. Journal of Global Optimization, 2023a.
  17. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pp.  10077–10094, 2023b.
  18. Calibrated surrogate losses for adversarially robust classification. In Conference on Learning Theory, pp.  408–451, 2020.
  19. Adversarial examples in multi-layer random relu networks. In Advances in Neural Information Processing Systems, pp. 9241–9252, 2021.
  20. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  21. Berkson, J. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357––365, 1944.
  22. Berkson, J. Why I prefer logits to probits. Biometrics, 7(4):327––339, 1951.
  23. Sample complexity of robust linear classification on separated data. In International Conference on Machine Learning, pp. 884–893, 2021.
  24. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp.  387–402, 2013.
  25. Blondel, M. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019.
  26. A universal law of robustness via isoperimetry. In Advances in Neural Information Processing Systems, 2021.
  27. Adversarial examples from cryptographic pseudo-random generators. arXiv preprint arXiv:1811.06418, 2018.
  28. Adversarial examples from computational constraints. In International Conference on Machine Learning, pp. 831–840, 2019.
  29. A single gradient step finds adversarial examples on random two-layers neural networks. In Advances in Neural Information Processing Systems, pp. 10081–10091, 2021.
  30. Curriculum adversarial training. In International Joint Conference on Artificial Intelligence, pp.  3740–3747, 2018.
  31. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Advances in Neural Information Processing Systems, pp. 521–534, 2022.
  32. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp.  39–57, 2017.
  33. Unlabeled data improves adversarial robustness. In Advances in neural information processing systems, 2019.
  34. Consistency of multiclass empirical risk minimization methods based on convex loss. Journal of Machine Learning Research, 7:2435–2447, 2006.
  35. The consistency of multicategory support vector machines. Advances in Computational Mathematics, 24(1):155–169, 2006.
  36. Cat: Customized adversarial training for improved robustness. In International Joint Conference on Artificial Intelligence, pp.  673–679, 2022.
  37. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016.
  38. Boosting with abstention. In Advances in Neural Information Processing Systems, 2016a.
  39. Learning with rejection. In International Conference on Algorithmic Learning Theory, pp.  67–82, 2016b.
  40. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp. 2206–2216, 2020.
  41. Pac-learning in the presence of adversaries. Advances in Neural Information Processing Systems, 2018.
  42. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345–2355, 2020.
  43. Consistent multilabel ranking through univariate losses. arXiv preprint arXiv:1206.6401, 2012.
  44. The complexity of adversarially robust proper learning of halfspaces with agnostic noise. arXiv preprint arXiv:2007.15220, 2020.
  45. Mma training: Direct input space margin maximization through adversarial training. In International Conference on Learning Representations, 2022.
  46. A unified view on multi-class support vector classification. Journal of Machine Learning Research, 17:1–32, 2016.
  47. On the consistency of ranking algorithms. In International Conference on Machine Learning, 2010.
  48. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pp.  637–657, 2015.
  49. Robust inference for multiclass classification. In Algorithmic Learning Theory, pp.  368–386, 2018.
  50. An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, 2019.
  51. An embedding framework for the design and analysis of consistent polyhedral surrogates. arXiv preprint arXiv:2206.14707, 2022.
  52. The adversarial consistency of surrogate risks for binary classification. arXiv preprint arXiv:2305.09956, 2023.
  53. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  54. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, pp. 21569–21580, 2021.
  55. On the consistency of multi-label learning. In Conference on learning theory, pp.  341–358, 2011.
  56. On the consistency of auc pairwise optimization. In International Joint Conference on Artificial Intelligence, 2015.
  57. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017.
  58. Adversarially robust distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  3996–4003, 2020.
  59. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  60. Uncovering the limits of adversarial training against norm-bounded adversarial examples. arXiv preprint arXiv:2010.03593, 2020.
  61. Fast provably robust decision trees and boosting. In International Conference on Machine Learning, pp. 8127–8144, 2022.
  62. When nas meets robustness: In search of robust architectures against adversarial attacks. In Conference on Computer Vision and Pattern Recognition, pp. 631–640, 2020.
  63. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  64. Averaging weights leads to wider optima and better generalization. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pp.  876–885, 2018.
  65. Enhancing adversarial training with second-order statistics of weights. In Conference on Computer Vision and Pattern Recognition, pp. 15273–15283, 2022.
  66. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
  67. Adversarial risk bounds via function transformation. arXiv preprint arXiv:1810.09519, 2018.
  68. Fat-shattering dimension of k𝑘kitalic_k-fold maxima. arXiv preprint arXiv:2110.04763, 2021.
  69. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.
  70. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
  71. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
  72. Theory and algorithms for forecasting time series. CoRR, abs/1803.05814, 2018.
  73. Discrepancy-based theory and algorithms for forecasting non-stationary time series. Annals of Mathematics and Artificial Intelligence, 88(4):367–399, 2020.
  74. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pp. 2501–2509, 2014.
  75. Adversarial vertex mixup: Toward better adversarially robust generalization. In Conference on Computer Vision and Pattern Recognition, pp. 272–281, 2020.
  76. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
  77. Domain invariant adversarial learning. Transactions of Machine Learning Research, 2022.
  78. On achieving optimal adversarial test error. In International Conference on Learning Representations, 2023.
  79. Towards defending multiple adversarial perturbations via gated batch normalization. arXiv preprint arXiv:2012.01654, 2020.
  80. Liu, Y. Fisher consistency of multicategory support vector machines. In Artificial intelligence and statistics, pp.  291–298, 2007.
  81. Consistency versus realizable H-consistency for multiclass classification. In International Conference on Machine Learning, pp. 801–809, 2013.
  82. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  83. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  84. ℋℋ{\mathscr{H}}script_H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023.
  85. Towards consistency in adversarial classification. In Advances in Neural Information Processing Systems, 2022.
  86. New analysis and algorithm for learning with drifting distributions. In Algorithmic Learning Theory, pp.  124–138, 2012.
  87. Foundations of Machine Learning. MIT Press, second edition, 2018.
  88. Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pp.  2512–2530, 2019.
  89. Efficiently learning adversarially robust halfspaces with noise. In International Conference on Machine Learning, pp. 7010–7021, 2020a.
  90. Reducing adversarially robust learning to non-robust pac learning. In Advances in Neural Information Processing Systems, volume 33, pp.  14626–14637, 2020b.
  91. Adversarially robust learning with unknown perturbation sets. In Conference on Learning Theory, pp.  3452–3482, 2021.
  92. Transductive robust learning guarantees. In International Conference on Artificial Intelligence and Statistics, pp.  11461–11471, 2022.
  93. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pp. 2398–2407, 2015.
  94. Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o⁢(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Dokl. akad. nauk Sssr, 269:543–547, 1983.
  95. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.
  96. The structured abstain problem and the lovász hinge. In Conference on Learning Theory, pp.  3718–3740, 2022.
  97. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017.
  98. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970–4979, 2019.
  99. Bag of tricks for adversarial training. arXiv preprint arXiv:2010.00467, 2020a.
  100. Boosting adversarial training with hypersphere embedding. In Advances in Neural Information Processing Systems, pp. 7779–7792, 2020b.
  101. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1–35, 2017.
  102. Multiclass classification calibration functions. arXiv preprint arXiv:1609.06385, 2016.
  103. Cost-sensitive multiclass classification risk bounds. In International Conference on Machine Learning, pp. 1391–1399, 2013.
  104. Understanding adversarial robustness through loss landscape geometries. arXiv preprint arXiv:1907.09061, 2019.
  105. Improving model robustness with latent distribution locally and globally. arXiv preprint arXiv:2107.04401, 2021.
  106. Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, 2019.
  107. Classification calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems, 2012.
  108. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research, 17(1):397–441, 2016.
  109. Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In Advances in Neural Information Processing Systems, 2013.
  110. Consistent algorithms for multiclass classification with a reject option. arXiv preprint arXiv:1505.04137, 2015.
  111. On NDCG consistency of listwise ranking methods. In International Conference on Artificial Intelligence and Statistics, pp.  618–626, 2011.
  112. Fixing data augmentation to improve adversarial robustness. arXiv preprint arXiv:2103.01946, 2021a.
  113. Data augmentation can improve robustness. In Advances in Neural Information Processing Systems, pp. 29935–29948, 2021b.
  114. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093–8104, 2020.
  115. Adversarial robustness with semi-infinite constrained learning. In Advances in Neural Information Processing Systems, pp. 6198–6215, 2021.
  116. Adversarially robust generalization requires more data. In Advances in neural information processing systems, 2018.
  117. Adversarial training for free! In Advances in Neural Information Processing Systems, pp. 3353–3364, 2019.
  118. Improving the generalization of adversarial training with domain adaptation. In International Conference on Learning Representations, 2019.
  119. Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
  120. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
  121. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  122. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36):1007–1025, 2007.
  123. Consistent polyhedral surrogates for top-k classification and variants. In International Conference on Machine Learning, pp. 21329–21359, 2022.
  124. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
  125. Formalizing generalization and robustness of neural networks to weight perturbations. In International Conference on Learning Representations, 2021.
  126. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  127. On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112(519):1311–1322, 2017.
  128. Verhulst, P. F. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113––121, 1838.
  129. Verhulst, P. F. Recherches mathématiques sur la loi d’accroissement de la population. Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1––42, 1845.
  130. A pac-bayes analysis of adversarial robustness. In Advances in Neural Information Processing Systems, pp. 14421–14433, 2021.
  131. Weston-Watkins hinge loss and ordered partitions. In Advances in neural information processing systems, pp. 19873–19883, 2020.
  132. On classification-calibration of gamma-phi losses. arXiv preprint arXiv:2302.07321, 2023.
  133. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595, 2019.
  134. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2020.
  135. Multi-class support vector machines. Technical report, Citeseer, 1998.
  136. Composite multiclass losses. Journal of Machine Learning Research, 17:1–52, 2016.
  137. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
  138. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems, pp. 2958–2969, 2020.
  139. Stability analysis and generalization bounds of adversarial training. In Advances in Neural Information Processing Systems, 2022.
  140. Intriguing properties of adversarial training at scale. In International Conference on Learning Representations, 2020.
  141. Feature denoising for improving adversarial robustness. In Conference on computer vision and pattern recognition, pp. 501–509, 2019.
  142. Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artificial Intelligence and Statistics, pp.  514–522, 2021.
  143. Dverge: diversifying vulnerabilities for enhanced robust generation of ensembles. Advances in Neural Information Processing Systems, pp. 5505–5515, 2020.
  144. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pp. 7085–7094, 2019.
  145. Interpreting adversarial robustness: A view from decision surface in input space. arXiv preprint arXiv:1810.00144, 2018.
  146. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  147. Adversarially robust generalization just requires more unlabeled data. arXiv preprint arXiv:1906.00555, 2019.
  148. You only propagate once: Accelerating adversarial training via maximal principle. In Advances in Neural Information Processing Systems, 2019a.
  149. Defense against adversarial attacks using feature scattering-based adversarial training. In Advances in Neural Information Processing Systems, 2019.
  150. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019b.
  151. Attacks which do not kill training make adversarial learning stronger. In International conference on machine learning, pp. 11278–11287, 2020a.
  152. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, 2020.
  153. Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pp. 11246–11255, 2020b.
  154. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004a.
  155. Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225–1251, 2004b.
  156. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.
  157. Revisiting discriminative vs. generative classifiers: Theory and implications. arXiv preprint arXiv:2302.02334, 2023.
Citations (166)

Summary

  • The paper derives the first H-consistency bounds for comp-sum losses, offering precise non-asymptotic guarantees for approximating the zero-one classification loss.
  • It introduces a novel structural formulation using concave functions and score differences to analyze minimizability gaps in surrogate losses.
  • Empirical evaluations on CIFAR datasets and the development of smooth adversarial losses underscore the practical impact and robustness of the theoretical findings.

A Theoretical Analysis of Cross-Entropy and Related Loss Functions

The paper presents a comprehensive theoretical analysis of cross-entropy and its related loss functions under the broader category of comp-sum losses. These include widely used losses such as the logistic loss, generalized cross-entropy, and mean absolute error. The objective is to derive precise, non-asymptotic guarantees, termed as H-consistency bounds, which express how closely minimizing these surrogate losses approximates minimizing the zero-one classification loss.

The principal contribution of this work is the derivation of the first H-consistency bounds for these comp-sum losses, thus extending the theoretical understanding beyond the commonly quoted Bayes consistency. These bounds rely on a novel analysis of the minimizability gaps, which are defined as the difference between the best-in-class expected loss and a pointwise infimum of the surrogate loss. The authors prove that these bounds are not only tight but also hypothesis set-specific, offering an intricate look into how surrogate minimization aligns with classification loss minimization for practical hypothesis sets.

A significant portion of the paper is dedicated to the methodological derivation of these bounds, facilitated by introducing the comp-sum loss family. The comp-sum losses are characterized through compositions involving concave functions, such as logarithms, and sums of score differences. This structural formulation allows for the broad application of theoretical findings to various cross-entropy-like loss functions.

The empirical analysis demonstrates the practical implications of the theoretical findings. The experiments focus on comparing comp-sum losses across several tasks and include evaluation datasets like CIFAR-10 and CIFAR-100. Results from these experiments underscore the tightness of the derived bounds and validate the theoretical predictions. For instance, the logistic loss, which is a special case of comp-sum losses, is shown to offer superior performance, consistent with its favorable theoretical properties outlined by the H-consistency bounds.

A significant extension is introduced in the context of adversarial robustness through the definition and analysis of smooth adversarial comp-sum losses. These are regularized versions of the comp-sum losses tailored to enhance adversarial robustness by incorporating smooth terms. The paper provides convincing theoretical arguments for employing these losses in adversarial settings by demonstrating their H-consistency bounds, thereby proposing robust algorithms that generalize well even under adversarial perturbations.

The implications of these findings extend towards a better understanding of the practical usability and theoretical backing for surrogate losses in classification tasks. Moreover, the theoretical framework and methodologies applied within the paper could serve as a foundation for future exploration of more complex loss structures and their roles in machine learning model optimization, particularly in robust and adversarial learning contexts.

While the presented work enhances our theoretical toolkit for analyzing classification losses, it also prompts further research questions. Future endeavors might delve into exploring other forms of comp-sum losses, assessing their empirical effectiveness across diverse datasets, or even constructing novel loss functions that incorporate finer control over the minimizability gaps through specific structural innovations. The exploration into non-complete hypothesis sets and distributional assumptions remains a promising avenue for extending the current theory. Overall, this work elegantly bridges the gap between theoretical consistency guarantees and practical performance outcomes, enriching both the academic discourse and applied methodologies in machine learning classifier training.