Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity (2112.14869v4)

Published 30 Dec 2021 in cs.LG

Abstract: We study a family of loss functions named label-distributionally robust (LDR) losses for multi-class classification that are formulated from distributionally robust optimization (DRO) perspective, where the uncertainty in the given label information are modeled and captured by taking the worse case of distributional weights. The benefits of this perspective are several fold: (i) it provides a unified framework to explain the classical cross-entropy (CE) loss and SVM loss and their variants, (ii) it includes a special family corresponding to the temperature-scaled CE loss, which is widely adopted but poorly understood; (iii) it allows us to achieve adaptivity to the uncertainty degree of label information at an instance level. Our contributions include: (1) we study both consistency and robustness by establishing top-$k$ ($\forall k\geq 1$) consistency of LDR losses for multi-class classification, and a negative result that a top-$1$ consistent and symmetric robust loss cannot achieve top-$k$ consistency simultaneously for all $k\geq 2$; (2) we propose a new adaptive LDR loss that automatically adapts the individualized temperature parameter to the noise degree of class label of each instance; (3) we demonstrate stable and competitive performance for the proposed adaptive LDR loss on 7 benchmark datasets under 6 noisy label and 1 clean settings against 13 loss functions, and on one real-world noisy dataset. The code is open-sourced at \url{https://github.com/Optimization-AI/ICML2023_LDR}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Are we done with imagenet? CoRR, abs/2006.07159, 2020. URL https://arxiv.org/abs/2006.07159.
  2. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  3. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems 32 (NeurIPS), pp.  1567–1578, 2019.
  4. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pp.  1597–1607, 2020.
  5. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018.
  6. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, March 2002. ISSN 1532-4435.
  7. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  248–255, 2009.
  8. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2016.
  9. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  10. Generalized jensen-shannon divergence loss for learning with noisy labels. Advances in Neural Information Processing Systems, 34:30284–30297, 2021.
  11. Libsvm data. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
  12. The amsterdam library of object images. International Journal of Computer Vision, 61(1):103–112, 2005.
  13. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  14. Training deep neural-networks using a noise adaptation layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H12GRgcxg.
  15. Masking: A new perspective of noisy supervision. CoRR, abs/1805.08193, 2018. URL http://arxiv.org/abs/1805.08193.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  17. Using trusted data to train deep networks on labels corrupted by severe noise. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ad554d8c3b06d6b97ee76a2448bd7913-Paper.pdf.
  18. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531v1.
  19. Fine-tuned language models for text classification. CoRR, abs/1801.06146, 2018. URL http://arxiv.org/abs/1801.06146.
  20. A comparision of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13:415–425, 2002.
  21. On loss functions for deep neural networks in classification. CoRR, abs/1702.05659, 2017. URL http://arxiv.org/abs/1702.05659.
  22. On the duality of strong convexity and strong smoothness : Learning applications and matrix regularization. 2009.
  23. Why do better loss functions lead to less transferable features? In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=8twKpG5s8Qh.
  24. CIFAR-10 and CIFAR-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html, 6:1, 2009.
  25. Lang, K. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pp.  331–339. Elsevier, 1995.
  26. Top-k multiclass svm. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28, pp.  325–333. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/0336dcbab05b9d5ad24f4333c7658a0e-Paper.pdf.
  27. Loss functions for top-k error: Analysis and insights. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 1468–1477. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.163. URL https://doi.org/10.1109/CVPR.2016.163.
  28. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE Trans. Pattern Anal. Mach. Intell., 40(7):1533–1554, 2018. doi: 10.1109/TPAMI.2017.2751607. URL https://doi.org/10.1109/TPAMI.2017.2751607.
  29. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33, 2020.
  30. Webvision database: Visual learning and understanding from web data. CoRR, abs/1708.02862, 2017. URL http://arxiv.org/abs/1708.02862.
  31. Normalized loss functions for deep learning with noisy labels. In International conference on machine learning, pp. 6543–6553. PMLR, 2020.
  32. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  33. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pp. 2971–2980, 2017.
  34. Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103:127–152, 2005.
  35. Memorization in deep neural networks: Does the loss function matter? In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.  131–142. Springer, 2021.
  36. Making deep neural networks robust to label noise: A loss correction approach. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2233–2241. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.240. URL https://doi.org/10.1109/CVPR.2017.240.
  37. A practical online method for distributionally deep robust optimization. arXiv preprint arXiv:2006.10138, 2020.
  38. Minimizing the maximal loss: How and why. In ICML, pp.  793–801, 2016.
  39. Tang, Y. Deep learning using support vector machines. CoRR, abs/1306.0239, 2013. URL http://arxiv.org/abs/1306.0239.
  40. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(May):1007–1025, 2007.
  41. From imagenet to image classification: Contextualizing progress on benchmarks. In ArXiv preprint arXiv:2005.11295, 2020.
  42. Learning with symmetric label noise: The importance of being unhinged. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/45c48cce2e2d7fbdea1afc51c7c6ad26-Paper.pdf.
  43. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  322–330, 2019.
  44. Multi-class support vector machines. Technical report, Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.
  45. On the consistency of top-k surrogate losses. In International Conference on Machine Learning, pp. 10727–10735. PMLR, 2020.
  46. Zhang, T. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res., 5:1225–1251, December 2004. ISSN 1532-4435.
  47. Learning noise transition matrix from only noisy labels via total variation regularization. In International Conference on Machine Learning, pp. 12501–12512. PMLR, 2021.
  48. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
  49. Asymmetric loss functions for learning with noisy labels. In International conference on machine learning, pp. 12846–12856. PMLR, 2021.
Citations (7)

Summary

We haven't generated a summary for this paper yet.