Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations (2403.08121v3)

Published 12 Mar 2024 in cs.LG, math.OC, and stat.ML

Abstract: This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. It is shown here that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in (Euclidean) norm and approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of the recently introduced neural correlation function. Additionally, this paper also studies the KKT points of the neural correlation function for feed-forward networks with (Leaky) ReLU and polynomial (Leaky) ReLU activations, deriving necessary and sufficient conditions for rank-one KKT points.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. A. Kumar and J. Haupt, “Directional convergence near small initializations and saddles in two-homogeneous neural networks,” arXiv preprint arXiv:2402.09226, 2024.
  2. G. Yang and E. J. Hu, “Tensor programs iv: Feature learning in infinite-width neural networks,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139.   PMLR, 18–24 Jul 2021, pp. 11 727–11 737.
  3. M. Geiger, S. Spigler, A. Jacot, and M. Wyart, “Disentangling feature and lazy training in deep neural networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 11, p. 113301, nov 2020.
  4. L. Chizat, E. Oyallon, and F. Bach, “On lazy training in differentiable programming,” in Advances in Neural Information Processing Systems, 2019.
  5. B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro, “Kernel and rich regimes in overparametrized models,” in Proceedings of Thirty Third Conference on Learning Theory, 2020, pp. 3635–3673.
  6. H. Maennel, O. Bousquet, and S. Gelly, “Gradient descent quantizes relu network features,” arXiv preprint arXiv:1803.08367, 2018.
  7. T. Luo, Z.-Q. J. Xu, Z. Ma, and Y. Zhang, “Phase diagram for two-layer relu neural networks at infinite-width limit,” Journal of Machine Learning Research, vol. 22, no. 71, pp. 1–47, 2021.
  8. E. Boursier, L. Pillaud-Vivien, and N. Flammarion, “Gradient flow dynamics of shallow reLU networks for square loss and orthogonal inputs,” in Advances in Neural Information Processing Systems, 2022.
  9. H. Min, E. Mallada, and R. Vidal, “Early neuron alignment in two-layer reLU networks with small initialization,” in The Twelfth International Conference on Learning Representations, 2024.
  10. K. Lyu, Z. Li, R. Wang, and S. Arora, “Gradient descent on two-layer nets: Margin maximization and simplicity bias,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 12 978–12 991.
  11. A. Saxe, J. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.”   International Conference on Learning Represenatations 2014, 2014.
  12. S. Arora, N. Cohen, W. Hu, and Y. Luo, “Implicit regularization in deep matrix factorization,” in Advances in Neural Information Processing Systems, 2019.
  13. R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender, “Approximation spaces of deep neural networks,” Constructive approximation, vol. 55, no. 1, pp. 259–367, 2022.
  14. J. M. Klusowski and A. R. Barron, “Approximation by combinations of relu and squared relu ridge functions with ℓ1superscriptℓ1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ℓ0superscriptℓ0\ell^{0}roman_ℓ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT controls,” IEEE Transactions on Information Theory, vol. 64, no. 12, pp. 7649–7656, 2018.
  15. K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon, “Recovery guarantees for one-hidden-layer neural networks,” in International conference on machine learning.   PMLR, 2017, pp. 4140–4149.
  16. M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2019.
  17. R. Livni, S. Shalev-Shwartz, and O. Shamir, “On the computational efficiency of training neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  18. A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in Neural Information Processing Systems, 2018.
  19. S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” in Advances in Neural Information Processing Systems, 2019.
  20. L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,” in Proceedings of Thirty Third Conference on Learning Theory, 2020, pp. 1305–1338.
  21. S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” in Proceedings of the Thirty-Second Conference on Learning Theory, 2019, pp. 2388–2464.
  22. L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  23. S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit regularization in matrix factorization,” in Advances in Neural Information Processing Systems, 2017.
  24. G. M. Rotskoff and E. Vanden-Eijnden, “Trainability and accuracy of neural networks: An interacting particle system approach,” ArXiv180500915 Cond-Mat Stat, July 2018.
  25. D. Stöger and M. Soltanolkotabi, “Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction,” in Advances in Neural Information Processing Systems, 2021.
  26. Z. Li, Y. Luo, and K. Lyu, “Towards reaving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning,” in International Conference on Learning Representations, 2021.
  27. M. Wang and C. Ma, “Understanding multi-phase optimization dynamics and rich nonlinear behaviors of reLU networks,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  28. A. Brutzkus and A. Globerson, “Why do larger models generalize better? A theoretical perspective via the XOR problem,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97.   PMLR, 09–15 Jun 2019, pp. 822–830.
  29. H. Zhou, Z. Qixuan, Z. Jin, T. Luo, Y. Zhang, and Z.-Q. Xu, “Empirical phase diagram for three-layer neural networks with infinite width,” in Advances in Neural Information Processing Systems, vol. 35.   Curran Associates, Inc., 2022, pp. 26 021–26 033.
  30. A. Atanasov, B. Bordelon, and C. Pehlevan, “Neural networks as kernel learners: The silent alignment effect,” in International Conference on Learning Representations, 2022.
  31. Z. Ji and M. Telgarsky, “Directional convergence and alignment in deep learning,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 17 176–17 186.
  32. K. Lyu and J. Li, “Gradient descent maximizes the margin of homogeneous neural networks,” in International Conference on Learning Representations, 2020.
  33. A. J. Wilkie, “Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function,” Journal of the American Mathematical Society, vol. 9, no. 4, pp. 1051–1094, 1996. [Online]. Available: http://www.jstor.org/stable/2152916
  34. A. Jacot, F. Ged, B. Şimşek, C. Hongler, and F. Gabriel, “Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity,” arXiv preprint arXiv:2106.15933, 2021.
  35. S. Pesme and N. Flammarion, “Saddle-to-saddle dynamics in diagonal linear networks,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com