Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes (2405.17580v2)

Published 27 May 2024 in cs.LG, cs.AI, and stat.ML

Abstract: The training dynamics of linear networks are well studied in two distinct setups: the lazy regime and balanced/active regime, depending on the initialization and width of the network. We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In the mixed regime, a part of the network is lazy while the other is balanced. More precisely the network is lazy along singular values that are below a certain threshold and balanced along those that are above the same threshold. At initialization, all singular values are lazy, allowing for the network to align itself with the task, so that later in time, when some of the singular value cross the threshold and become active they will converge rapidly (convergence in the balanced regime is notoriously difficult in the absence of alignment). The mixed regime is the `best of both worlds': it converges from any random initialization (in contrast to balanced dynamics which require special initialization), and has a low rank bias (absent in the lazy dynamics). This allows us to prove an almost complete phase diagram of training behavior as a function of the variance at initialization and the width, for a MSE training task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  2. The staircase property: How hierarchical structure can guide deep learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  3. High-dimensional dynamics of generalization error in neural networks, 2017.
  4. Isotropic local laws for sample covariance and generalized Wigner matrices. Electronic Journal of Probability, 19(none):1 – 53, 2014.
  5. A convergence theory for deep learning via over-parameterization. PMLR, pages 242–252, 2019.
  6. A convergence analysis of gradient descent for deep linear neural networks. In International Conference on Learning Representations, 2019.
  7. On the optimization of deep networks: Implicit acceleration by overparameterization. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 244–253. PMLR, 10–15 Jul 2018.
  8. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
  9. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.
  10. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  11. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  12. Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
  13. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256, 2022.
  14. Exact learning dynamics of deep linear networks with prior knowledge. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 6615–6629. Curran Associates, Inc., 2022.
  15. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
  16. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In Advances in Neural Information Processing Systems 31, pages 3040–3050. Curran Associates, Inc., 2018.
  17. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1305–1338. PMLR, 09–12 Jul 2020.
  18. Infinite-width limit of deep linear neural networks. Communications on Pure and Applied Mathematics, 2022.
  19. Representation costs of linear neural networks: Analysis and design. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  20. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
  21. Kenji Fukumizu. Effect of batch learning in multilayer neural networks. In International Conference on Neural Information Processing, 1998.
  22. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
  23. Implicit regularization of discrete gradient dynamics in linear neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  24. Arthur Jacot. Bottleneck structure in learned features: Low-dimension vs regularity tradeoff. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 23607–23629. Curran Associates, Inc., 2023.
  25. Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. ICLR, 2023.
  26. Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In The Eleventh International Conference on Learning Representations, 2023.
  27. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pages 8580–8589. Curran Associates, Inc., 2018.
  28. Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity, 2022.
  29. Feature learning in l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized dnns: Attraction/repulsion and sparsity. In Advances in Neural Information Processing Systems, volume 36, 2022.
  30. Gradient descent aligns the layers of deep linear networks. CoRR, abs/1810.02032, 2018.
  31. Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15200–15238. PMLR, 23–29 Jul 2023.
  32. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
  33. On the training dynamics of deep networks with l⁢_⁢2𝑙_2l\_2italic_l _ 2 regularization. Advances in Neural Information Processing Systems, 33:4790–4799, 2020.
  34. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020.
  35. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. arXiv preprint arXiv:2003.00307, 2020.
  36. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In International Conference on Machine Learning, pages 7760–7768. PMLR, 2021.
  37. On the convergence of gradient flow on multi-layer linear models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 24850–24887. PMLR, 23–29 Jul 2023.
  38. A function space view of bounded norm infinite width relu nets: The multivariate case. In International Conference on Learning Representations, 2020.
  39. Pytorch: An imperative style, high-performance deep learning library. arXiv, 2019.
  40. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29218–29230. Curran Associates, Inc., 2021.
  41. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In Advances in Neural Information Processing Systems 31, pages 7146–7155. Curran Associates, Inc., 2018.
  42. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2014.
  43. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019.
  44. Understanding the dynamics of gradient flow in overparameterized linear models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10153–10161. PMLR, 18–24 Jul 2021.
  45. Implicit bias of SGD in l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized linear DNNs: One-way jumps from high to low rank. In The Twelfth International Conference on Learning Representations, 2024.
  46. Kernel and rich regimes in overparametrized models, 2020.
  47. How over-parameterization slows down gradient descent in matrix sensing: The curses of symmetry and initialization. In OPT 2023: Optimization for Machine Learning, 2023.
  48. Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2262–2284. PMLR, 25–27 Apr 2023.
  49. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  50. Feature learning in infinite-width neural networks, 2020.
  51. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
  52. On the global convergence of training deep linear resnets. arXiv preprint arXiv:2003.01094, 2020.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 6 likes.