Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deconstructing the Goldilocks Zone of Neural Network Initialization (2402.03579v2)

Published 5 Feb 2024 in cs.LG and math.OC

Abstract: The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Temperature check: theory and practice for training models with softmax-cross-entropy losses. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  2. A convergence theory for deep learning via over-parameterization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  242–252. PMLR, 09–15 Jun 2019.
  3. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643 – 1697, 2005.
  4. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pp.  29–37, 1988.
  5. High-dimensional sgd aligns with emerging outlier eigenspaces, 2023.
  6. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
  7. Rank-one modification of the symmetric eigenproblem. Numerische Mathematik, 31(1):31–48, 1978.
  8. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  9. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.
  10. Sharp minima can generalize for deep nets. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1019–1028. PMLR, 06–11 Aug 2017.
  11. Gradient descent provably optimizes over-parameterized neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
  12. Emergent properties of the local geometry of neural loss landscapes. CoRR, abs/1910.05929, 2019.
  13. The Goldilocks zone: Towards better understanding of neural network loss landscapes. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 2019.
  14. An investigation into neural net optimization via Hessian eigenvalue density. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2232–2241. PMLR, 09–15 Jun 2019.
  15. Understanding the difficulty of training deep feedforward neural networks. In Teh, Y. W. and Titterington, M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp.  249–256. PMLR, 13–15 May 2010.
  16. pytorch-hessian-eigenthings: efficient pytorch hessian eigendecomposition, October 2018.
  17. Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 2018.
  18. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  19. Flat minima. Neural computation, 9(1):1–42, 1997.
  20. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp.  448–456. JMLR.org, 2015.
  21. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
  22. The break-even point on optimization trajectories of deep neural networks. In International Conference on Learning Representations, ICLR, 2020.
  23. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR, 2017.
  24. Grokking as the transition from lazy to rich training dynamics. arXiv preprint arXiv:2310.06110, 2023.
  25. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
  26. The large learning rate phase of deep learning: the catapult mechanism, 2020.
  27. Bad global minima exist and SGD can reach them. Advances in Neural Information Processing Systems, 33, 2020.
  28. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2023.
  29. Implicit bias in deep linear classification: Initialization scale vs training accuracy. In NeurIPS 2020. ACM, December 2020.
  30. Vanishing curvature in randomly initialized deep ReLU networks. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  7942–7975. PMLR, 28–30 Mar 2022.
  31. Papyan, V. The full spectrum of deepnet Hessians at scale: Dynamics with SGD training and sample size. arXiv: Learning, 2018.
  32. Papyan, V. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet Hessians. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5012–5021. PMLR, 09–15 Jun 2019.
  33. Papyan, V. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 21(252):1–64, 2020.
  34. On the difficulty of training recurrent neural networks. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.  1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  35. Automatic differentiation in pytorch. 2017.
  36. Geometry of neural network loss surfaces via random matrix theory. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  2798–2806. PMLR, 06–11 Aug 2017.
  37. Optimality and sub-optimality of PCA I: Spiked random matrix models. The Annals of Statistics, 46(5), oct 2018.
  38. Unveiling the Hessian’s connection to the decision boundary, 2023.
  39. Singularity of the hessian in deep learning. CoRR, abs/1611.07476, 2016.
  40. Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  9983–9994. Curran Associates, Inc., 2022.
  41. Mitigating neural network overconfidence with logit normalization. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  23631–23644. PMLR, 17–23 Jul 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets