Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Three Mechanisms of Feature Learning in a Linear Network (2401.07085v3)

Published 13 Jan 2024 in cs.LG and cs.AI

Abstract: Understanding the dynamics of neural networks in different width regimes is crucial for improving their training and performance. We present an exact solution for the learning dynamics of a one-hidden-layer linear network, with one-dimensional data, across any finite width, uniquely exhibiting both kernel and feature learning phases. This study marks a technical advancement by enabling the analysis of the training trajectory from any initialization and a detailed phase diagram under varying common hyperparameters such as width, layer-wise learning rates, and scales of output and initialization. We identify three novel prototype mechanisms specific to the feature learning regime: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling, which contrast starkly with the dynamics observed in the kernel regime. Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. How to initialize your network? robust initialization for weightnorm & resnets. Advances in Neural Information Processing Systems, 32.
  2. Neural networks as kernel learners: The silent alignment effect. arXiv preprint arXiv:2111.00034.
  3. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pages 2269–2277. PMLR.
  4. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256.
  5. Label-aware neural tangent kernel: Toward better generalization and local elasticity. Advances in Neural Information Processing Systems, 33:15847–15858.
  6. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956.
  7. Landscape and training regimes in deep learning. Physics Reports, 924:1–18.
  8. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989.
  9. Provable benefit of orthogonal initialization in optimizing deep linear networks. arXiv preprint arXiv:2001.05992.
  10. Dynamics of deep neural networks and neural tangent hierarchy. In International conference on machine learning, pages 4542–4551. PMLR.
  11. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867.
  12. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  13. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31.
  14. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  15. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR.
  16. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964.
  17. Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12360–12370.
  18. Abide by the law and follow the flow: Conservation laws for gradient flows.
  19. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671.
  20. Feature learning in neural networks and kernel machines that recursively learn features. arXiv preprint arXiv:2212.13881.
  21. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
  22. On the stepwise nature of self-supervised learning. arXiv preprint arXiv:2303.15438.
  23. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR.
  24. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
  25. Training behavior of deep neural network in frequency domain. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer.
  26. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522.
  27. A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientific Machine Learning, pages 144–164. PMLR.
  28. Law of balance and stationary distribution of stochastic gradient descent. arXiv preprint arXiv:2308.06671.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)