Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy (2403.07379v2)
Abstract: We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
- Why do we need weight decay in modern deep learning?, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- The implicit bias of batch normalization in linear models and two-layer linear convolutional neural networks. In The Thirty Sixth Annual Conference on Learning Theory, pp. 5699–5753. PMLR, 2023.
- Spann: Highly-efficient billion-scale approximate nearest neighbor search, 2021.
- On lazy training in differentiable programming, 2020.
- Gradient descent on neural networks typically occurs at the edge of stability, 2022.
- Asymmetric sparse kernel approximations for large-scale visual search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2107–2114, 2014.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1, 2021.
- Sharpness-aware minimization for efficiently improving generalization, 2021.
- Training batchnorm and only batchnorm: On the expressive power of random features in cnns, 2021.
- Fantastic generalization measures are nowhere to be found. arXiv preprint arXiv:2309.13658, 2023.
- Studying large language model generalization with influence functions, 2023.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832–1841. PMLR, 2018.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Lora: Low-rank adaptation of large language models, 2021.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Neural tangent kernel: Convergence and generalization in neural networks, 2020.
- Towards understanding how momentum improves generalization in deep learning, 2022. URL https://openreview.net/forum?id=lf0W6tcWmh-.
- Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176–17186, 2020.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations, 2019.
- Towards explaining the regularization effect of initial large learning rate in training neural networks, 2020.
- Implicit bias of deep learning in the large learning rate phase: A data separability perspective. Applied Sciences, 13(6):3961, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
- Implicit bias in deep linear classification: Initialization scale vs training accuracy, 2020.
- Relative flatness and generalization. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=sygvo7ctb_.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Analytic insights into structure and rank of neural network hessian maps. Advances in Neural Information Processing Systems, 34:23914–23927, 2021.
- Phenomenology of double descent in finite-width neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=lTqGXfn9Tv.
- Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32, 2019.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022.
- Why flatness does and does not correlate with generalization for deep neural networks, 2021.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.