Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise (2208.08003v5)

Published 17 Aug 2022 in cs.LG and stat.ML

Abstract: Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a \textit{final ascent} in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger $\ell_2$ regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020a.
  2. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. volume 119. PMLR, 2020b.
  3. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  4. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
  5. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  6. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  7. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Benign overfitting in two-layer convolutional neural networks. arXiv preprint arXiv:2202.06526, 2022.
  10. The interplay between implicit bias and benign overfitting in two-layer linear networks. arXiv preprint arXiv:2108.11489, 2021.
  11. Exact expressions for double descent and implicit regularization via surrogate random design. Advances in neural information processing systems, 33:5152–5164, 2020.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020.
  15. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  16. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
  17. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020.
  18. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
  19. Are wider nets better given the same number of parameters? arXiv preprint arXiv:2010.14495, 2020.
  20. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537, 2018.
  21. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  23. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pages 2309–2318, 2018.
  24. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning, pages 4804–4815. PMLR, 2020.
  25. Pruning’s effect on generalization through the lens of training and regularization. arXiv preprint arXiv:2210.13738, 2022.
  26. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.
  27. Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 3167–3179, 2016.
  28. Learning multiple layers of features from tiny images. 2009.
  29. On the role of optimization in double descent: A least squares study. Advances in Neural Information Processing Systems, 34, 2021.
  30. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  31. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  32. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
  33. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331–20342, 2020.
  34. What training reveals about neural network complexity. Advances in Neural Information Processing Systems, 34, 2021.
  35. Benign, tempered, or catastrophic: A taxonomy of overfitting. arXiv preprint arXiv:2207.06569, 2022.
  36. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 2019.
  37. Coresets for robust training of deep neural networks against noisy labels. Advances in Neural Information Processing Systems, 33, 2020.
  38. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
  39. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  40. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  41. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
  44. The singular values of convolutional layers. arXiv preprint arXiv:1805.10408, 2018.
  45. A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019.
  46. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  47. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
  48. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389, 2020.
  49. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286, 2020.
  50. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020.
  51. Are anchor points really indispensable in label-noise learning? Advances in neural information processing systems, 32, 2019.
  52. Robust early-learning: Hindering the memorization of noisy labels. In International conference on learning representations, 2020.
  53. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pages 10767–10777. PMLR, 2020.
  54. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pages 8778–8788, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yihao Xue (10 papers)
  2. Kyle Whitecross (2 papers)
  3. Baharan Mirzasoleiman (51 papers)
Citations (1)