Cross-Entropy Loss Functions: Theoretical Analysis and Applications (2304.07288v2)
Abstract: Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when the softmax is used. But, what guarantees can we rely on when using cross-entropy as a surrogate loss? We present a theoretical analysis of a broad family of loss functions, comp-sum losses, that includes cross-entropy (or logistic loss), generalized cross-entropy, the mean absolute error and other cross-entropy-like loss functions. We give the first $H$-consistency bounds for these loss functions. These are non-asymptotic guarantees that upper bound the zero-one loss estimation error in terms of the estimation error of a surrogate loss, for the specific hypothesis set $H$ used. We further show that our bounds are tight. These bounds depend on quantities called minimizability gaps. To make them more explicit, we give a specific analysis of these gaps for comp-sum losses. We also introduce a new family of loss functions, smooth adversarial comp-sum losses, that are derived from their comp-sum counterparts by adding in a related smooth term. We show that these loss functions are beneficial in the adversarial setting by proving that they admit $H$-consistency bounds. This leads to new adversarial robustness algorithms that consist of minimizing a regularized smooth adversarial comp-sum loss. While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy.
- On consistent surrogate risk minimization and property elicitation. In Conference on Learning Theory, pp. 4–22, 2015.
- Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems, 2019.
- Understanding and improving fast adversarial training. In Advances in Neural Information Processing Systems, pp. 16048–16059, 2020.
- Black-box certification and learning under adversarial perturbations. In International Conference on Machine Learning, pp. 388–398, 2020.
- Adversarially robust learning of real-valued functions. arXiv preprint arXiv:2206.12977, 2022.
- Improved generalization bounds for robust learning. In Algorithmic Learning Theory, pp. 162–183, 2019.
- A characterization of semi-supervised adversarially robust pac learnability. In Advances in Neural Information Processing Systems, 2022a.
- Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897–7927, 2022b.
- On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, pp. 13737–13747, 2019.
- Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431–441, 2020.
- Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pp. 9804–9815, 2021a.
- On the existence of the adversarial bayes classifier. In Advances in Neural Information Processing Systems, pp. 2978–2990, 2021b.
- A finer calibration analysis for adversarial robustness. arXiv preprint arXiv:2105.01550, 2021c.
- Multi-class ℋℋ{\mathscr{H}}script_H-consistency bounds. In Advances in neural information processing systems, 2022a.
- ℋℋ{\mathscr{H}}script_H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, 2022b.
- DC-programming for neural network optimizations. Journal of Global Optimization, 2023a.
- Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pp. 10077–10094, 2023b.
- Calibrated surrogate losses for adversarially robust classification. In Conference on Learning Theory, pp. 408–451, 2020.
- Adversarial examples in multi-layer random relu networks. In Advances in Neural Information Processing Systems, pp. 9241–9252, 2021.
- Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Berkson, J. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357––365, 1944.
- Berkson, J. Why I prefer logits to probits. Biometrics, 7(4):327––339, 1951.
- Sample complexity of robust linear classification on separated data. In International Conference on Machine Learning, pp. 884–893, 2021.
- Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402, 2013.
- Blondel, M. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019.
- A universal law of robustness via isoperimetry. In Advances in Neural Information Processing Systems, 2021.
- Adversarial examples from cryptographic pseudo-random generators. arXiv preprint arXiv:1811.06418, 2018.
- Adversarial examples from computational constraints. In International Conference on Machine Learning, pp. 831–840, 2019.
- A single gradient step finds adversarial examples on random two-layers neural networks. In Advances in Neural Information Processing Systems, pp. 10081–10091, 2021.
- Curriculum adversarial training. In International Joint Conference on Artificial Intelligence, pp. 3740–3747, 2018.
- Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Advances in Neural Information Processing Systems, pp. 521–534, 2022.
- Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39–57, 2017.
- Unlabeled data improves adversarial robustness. In Advances in neural information processing systems, 2019.
- Consistency of multiclass empirical risk minimization methods based on convex loss. Journal of Machine Learning Research, 7:2435–2447, 2006.
- The consistency of multicategory support vector machines. Advances in Computational Mathematics, 24(1):155–169, 2006.
- Cat: Customized adversarial training for improved robustness. In International Joint Conference on Artificial Intelligence, pp. 673–679, 2022.
- A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016.
- Boosting with abstention. In Advances in Neural Information Processing Systems, 2016a.
- Learning with rejection. In International Conference on Algorithmic Learning Theory, pp. 67–82, 2016b.
- Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp. 2206–2216, 2020.
- Pac-learning in the presence of adversaries. Advances in Neural Information Processing Systems, 2018.
- Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345–2355, 2020.
- Consistent multilabel ranking through univariate losses. arXiv preprint arXiv:1206.6401, 2012.
- The complexity of adversarially robust proper learning of halfspaces with agnostic noise. arXiv preprint arXiv:2007.15220, 2020.
- Mma training: Direct input space margin maximization through adversarial training. In International Conference on Learning Representations, 2022.
- A unified view on multi-class support vector classification. Journal of Machine Learning Research, 17:1–32, 2016.
- On the consistency of ranking algorithms. In International Conference on Machine Learning, 2010.
- Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pp. 637–657, 2015.
- Robust inference for multiclass classification. In Algorithmic Learning Theory, pp. 368–386, 2018.
- An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, 2019.
- An embedding framework for the design and analysis of consistent polyhedral surrogates. arXiv preprint arXiv:2206.14707, 2022.
- The adversarial consistency of surrogate risks for binary classification. arXiv preprint arXiv:2305.09956, 2023.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, pp. 21569–21580, 2021.
- On the consistency of multi-label learning. In Conference on learning theory, pp. 341–358, 2011.
- On the consistency of auc pairwise optimization. In International Joint Conference on Artificial Intelligence, 2015.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2017.
- Adversarially robust distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3996–4003, 2020.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Uncovering the limits of adversarial training against norm-bounded adversarial examples. arXiv preprint arXiv:2010.03593, 2020.
- Fast provably robust decision trees and boosting. In International Conference on Machine Learning, pp. 8127–8144, 2022.
- When nas meets robustness: In search of robust architectures against adversarial attacks. In Conference on Computer Vision and Pattern Recognition, pp. 631–640, 2020.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Averaging weights leads to wider optima and better generalization. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pp. 876–885, 2018.
- Enhancing adversarial training with second-order statistics of weights. In Conference on Computer Vision and Pattern Recognition, pp. 15273–15283, 2022.
- Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
- Adversarial risk bounds via function transformation. arXiv preprint arXiv:1810.09519, 2018.
- Fat-shattering dimension of k𝑘kitalic_k-fold maxima. arXiv preprint arXiv:2110.04763, 2021.
- Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
- Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
- Theory and algorithms for forecasting time series. CoRR, abs/1803.05814, 2018.
- Discrepancy-based theory and algorithms for forecasting non-stationary time series. Annals of Mathematics and Artificial Intelligence, 88(4):367–399, 2020.
- Multi-class deep boosting. In Advances in Neural Information Processing Systems, pp. 2501–2509, 2014.
- Adversarial vertex mixup: Toward better adversarially robust generalization. In Conference on Computer Vision and Pattern Recognition, pp. 272–281, 2020.
- Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
- Domain invariant adversarial learning. Transactions of Machine Learning Research, 2022.
- On achieving optimal adversarial test error. In International Conference on Learning Representations, 2023.
- Towards defending multiple adversarial perturbations via gated batch normalization. arXiv preprint arXiv:2012.01654, 2020.
- Liu, Y. Fisher consistency of multicategory support vector machines. In Artificial intelligence and statistics, pp. 291–298, 2007.
- Consistency versus realizable H-consistency for multiclass classification. In International Conference on Machine Learning, pp. 801–809, 2013.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- ℋℋ{\mathscr{H}}script_H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023.
- Towards consistency in adversarial classification. In Advances in Neural Information Processing Systems, 2022.
- New analysis and algorithm for learning with drifting distributions. In Algorithmic Learning Theory, pp. 124–138, 2012.
- Foundations of Machine Learning. MIT Press, second edition, 2018.
- Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pp. 2512–2530, 2019.
- Efficiently learning adversarially robust halfspaces with noise. In International Conference on Machine Learning, pp. 7010–7021, 2020a.
- Reducing adversarially robust learning to non-robust pac learning. In Advances in Neural Information Processing Systems, volume 33, pp. 14626–14637, 2020b.
- Adversarially robust learning with unknown perturbation sets. In Conference on Learning Theory, pp. 3452–3482, 2021.
- Transductive robust learning guarantees. In International Conference on Artificial Intelligence and Statistics, pp. 11461–11471, 2022.
- Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pp. 2398–2407, 2015.
- Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Dokl. akad. nauk Sssr, 269:543–547, 1983.
- Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.
- The structured abstain problem and the lovász hinge. In Conference on Learning Theory, pp. 3718–3740, 2022.
- On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017.
- Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970–4979, 2019.
- Bag of tricks for adversarial training. arXiv preprint arXiv:2010.00467, 2020a.
- Boosting adversarial training with hypersphere embedding. In Advances in Neural Information Processing Systems, pp. 7779–7792, 2020b.
- On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1–35, 2017.
- Multiclass classification calibration functions. arXiv preprint arXiv:1609.06385, 2016.
- Cost-sensitive multiclass classification risk bounds. In International Conference on Machine Learning, pp. 1391–1399, 2013.
- Understanding adversarial robustness through loss landscape geometries. arXiv preprint arXiv:1907.09061, 2019.
- Improving model robustness with latent distribution locally and globally. arXiv preprint arXiv:2107.04401, 2021.
- Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, 2019.
- Classification calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems, 2012.
- Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research, 17(1):397–441, 2016.
- Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In Advances in Neural Information Processing Systems, 2013.
- Consistent algorithms for multiclass classification with a reject option. arXiv preprint arXiv:1505.04137, 2015.
- On NDCG consistency of listwise ranking methods. In International Conference on Artificial Intelligence and Statistics, pp. 618–626, 2011.
- Fixing data augmentation to improve adversarial robustness. arXiv preprint arXiv:2103.01946, 2021a.
- Data augmentation can improve robustness. In Advances in Neural Information Processing Systems, pp. 29935–29948, 2021b.
- Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093–8104, 2020.
- Adversarial robustness with semi-infinite constrained learning. In Advances in Neural Information Processing Systems, pp. 6198–6215, 2021.
- Adversarially robust generalization requires more data. In Advances in neural information processing systems, 2018.
- Adversarial training for free! In Advances in Neural Information Processing Systems, pp. 3353–3364, 2019.
- Improving the generalization of adversarial training with domain adaptation. In International Conference on Learning Representations, 2019.
- Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36):1007–1025, 2007.
- Consistent polyhedral surrogates for top-k classification and variants. In International Conference on Machine Learning, pp. 21329–21359, 2022.
- Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
- Formalizing generalization and robustness of neural networks to weight perturbations. In International Conference on Learning Representations, 2021.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112(519):1311–1322, 2017.
- Verhulst, P. F. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113––121, 1838.
- Verhulst, P. F. Recherches mathématiques sur la loi d’accroissement de la population. Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1––42, 1845.
- A pac-bayes analysis of adversarial robustness. In Advances in Neural Information Processing Systems, pp. 14421–14433, 2021.
- Weston-Watkins hinge loss and ordered partitions. In Advances in neural information processing systems, pp. 19873–19883, 2020.
- On classification-calibration of gamma-phi losses. arXiv preprint arXiv:2302.07321, 2023.
- On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595, 2019.
- Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2020.
- Multi-class support vector machines. Technical report, Citeseer, 1998.
- Composite multiclass losses. Journal of Machine Learning Research, 17:1–52, 2016.
- Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
- Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems, pp. 2958–2969, 2020.
- Stability analysis and generalization bounds of adversarial training. In Advances in Neural Information Processing Systems, 2022.
- Intriguing properties of adversarial training at scale. In International Conference on Learning Representations, 2020.
- Feature denoising for improving adversarial robustness. In Conference on computer vision and pattern recognition, pp. 501–509, 2019.
- Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artificial Intelligence and Statistics, pp. 514–522, 2021.
- Dverge: diversifying vulnerabilities for enhanced robust generation of ensembles. Advances in Neural Information Processing Systems, pp. 5505–5515, 2020.
- Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pp. 7085–7094, 2019.
- Interpreting adversarial robustness: A view from decision surface in input space. arXiv preprint arXiv:1810.00144, 2018.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Adversarially robust generalization just requires more unlabeled data. arXiv preprint arXiv:1906.00555, 2019.
- You only propagate once: Accelerating adversarial training via maximal principle. In Advances in Neural Information Processing Systems, 2019a.
- Defense against adversarial attacks using feature scattering-based adversarial training. In Advances in Neural Information Processing Systems, 2019.
- Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019b.
- Attacks which do not kill training make adversarial learning stronger. In International conference on machine learning, pp. 11278–11287, 2020a.
- Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, 2020.
- Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pp. 11246–11255, 2020b.
- Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004a.
- Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225–1251, 2004b.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.
- Revisiting discriminative vs. generative classifiers: Theory and implications. arXiv preprint arXiv:2302.02334, 2023.