Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The twin peaks of learning neural networks (2401.12610v2)

Published 23 Jan 2024 in cs.LG, cond-mat.dis-nn, math.PR, math.ST, and stat.TH

Abstract: Recent works demonstrated the existence of a double-descent phenomenon for the generalization error of neural networks, where highly overparameterized models escape overfitting and achieve good test performance, at odds with the standard bias-variance trade-off described by statistical learning theory. In the present work, we explore a link between this phenomenon and the increase of complexity and sensitivity of the function represented by neural networks. In particular, we study the Boolean mean dimension (BMD), a metric developed in the context of Boolean function analysis. Focusing on a simple teacher-student setting for the random feature model, we derive a theoretical analysis based on the replica method that yields an interpretable expression for the BMD, in the high dimensional regime where the number of data points, the number of features, and the input size grow to infinity. We find that, as the degree of overparameterization of the network is increased, the BMD reaches an evident peak at the interpolation threshold, in correspondence with the generalization error peak, and then slowly approaches a low asymptotic value. The same phenomenology is then traced in numerical experiments with different model classes and training setups. Moreover, we find empirically that adversarially initialized models tend to show higher BMD values, and that models that are more robust to adversarial attacks exhibit a lower BMD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446.
  2. Comparing dynamics: Deep neural networks versus glassy systems. In International Conference on Machine Learning, pages 314–323. PMLR.
  3. Learning through atypical phase transitions in overparameterized neural networks. Physical Review E, 106(1):014116.
  4. Wide flat minima and optimal generalization in classifying high-dimensional gaussian mixtures. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124012.
  5. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. Triple descent and the two kinds of overfitting: Where & why do they appear? Advances in Neural Information Processing Systems, 33:3058–3069.
  8. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR.
  9. The jackknife estimate of variance. The Annals of Statistics, pages 586–596.
  10. Statistical mechanics of learning. Cambridge University Press.
  11. Mean dimension of generative models for protein sequences. bioRxiv, pages 2022–12.
  12. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115.
  13. Generalisation error in learning with random features and the hidden manifold model. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3452–3462. PMLR.
  14. Probing transfer learning with a model of synthetic correlated datasets. Machine Learning: Science and Technology, 3(1):015030.
  15. Xavier initialization. J. Mach. Learn. Res.
  16. Modelling the influence of data structure on learning in neural networks: the hidden manifold model. Physical Review X, 10:041044.
  17. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):1–42.
  18. The mean dimension of neural networks–what causes the interaction effects? arXiv preprint arXiv:2207.04890.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  20. Efficient estimation of the anova mean dimension, with an application to neural net classification. SIAM/ASA Journal on Uncertainty Quantification, 9(2):708–730.
  21. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178.
  22. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
  23. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  24. Estimating mean dimensionality of analysis of variance decompositions. Journal of the American Statistical Association, 101(474):712–721.
  25. Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33:8543–8552.
  26. Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model. arXiv preprint arXiv:2102.08127.
  27. Malatesta, E. M. (2023). High-dimensional manifold of solutions in neural networks: insights from statistical physics. arXiv preprint arXiv:2309.09240.
  28. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355.
  29. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company.
  30. The role of regularization in classification of high-dimensional noisy gaussian mixture. In International conference on machine learning, pages 6874–6883. PMLR.
  31. Methods for interpreting and understanding deep neural networks. Digital signal processing, 73:1–15.
  32. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003.
  33. Exploring generalization in deep learning. Advances in neural information processing systems, 30.
  34. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations.
  35. O’Donnell, R. (2014). Analysis of boolean functions. Cambridge University Press.
  36. OpenAI (2023). Gpt-4 technical report.
  37. Opper, M. (1995). Statistical mechanics of learning: Generalization. The handbook of brain theory and neural networks, pages 922–925.
  38. Owen, A. B. (2003). The dimension distribution and quadrature test functions. Statistica Sinica, pages 1–17.
  39. Nonlinear random matrix theory for deep learning. Advances in neural information processing systems, 30.
  40. Random features for large-scale kernel machines. In NIPS, volume 3, page 5. Citeseer.
  41. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv. org/abs/2204.06125, 7.
  42. High-resolution image synthesis with latent diffusion models. pages 10684–10695.
  43. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215.
  44. Sejnowski, T. J. (2020). The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48):30033–30038.
  45. Llama 2: Open foundation and fine-tuned chat models.
  46. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522.
  47. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999.
  48. Attention is all you need. Advances in neural information processing systems, 30.
  49. Explainable artificial intelligence: a systematic review. arXiv preprint arXiv:2006.00093.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets