Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron (2302.10034v2)

Published 20 Feb 2023 in cs.LG, math.OC, and stat.ML

Abstract: We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Learning and generalization in overparameterized neural networks, going beyond two layers, 2018. URL https://arxiv.org/abs/1811.04918.
  2. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  3. Y. Arjevani and M. Field. Annihilation of spurious minima in two-layer relu networks. arXiv preprint arXiv:2210.06088, 2022.
  4. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  5. Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. arXiv preprint arXiv:2206.00939, 2022.
  6. A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
  7. Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
  8. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  9. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  10. S. Dasgupta and L. Schulman. A two-round variant of em for gaussian mixtures, 2013. URL https://arxiv.org/abs/1301.3850.
  11. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1339–1348. PMLR, 2018a.
  12. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
  13. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017.
  14. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in neural information processing systems, 31, 2018b.
  15. Gradient descent provably optimizes over-parameterized neural networks, 2018c. URL https://arxiv.org/abs/1810.02054.
  16. Singularity, misspecification, and the convergence rate of em, 2018. URL https://arxiv.org/abs/1810.00828.
  17. Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory, pages 1887–1936. PMLR, 2021.
  18. Learning one convolutional layer with overlapping patches. In International Conference on Machine Learning, pages 1783–1791. PMLR, 2018.
  19. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  20. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.
  21. Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
  22. Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
  23. Learning over-parametrized two-layer relu neural networks beyond ntk. arXiv preprint arXiv:2007.04596, 2020.
  24. Towards understanding the importance of shortcut connections in residual networks. Advances in neural information processing systems, 32, 2019.
  25. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020.
  26. Y. Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  27. A rigorous framework for the mean field limit of multilayer neural networks. arXiv preprint arXiv:2001.11443, 2020.
  28. A. Nitanda and T. Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
  29. S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  30. Soft mode in the dynamics of over-realizable online learning for soft committee machines. Physical Review E, 105(5):L052302, 2022.
  31. I. Safran and O. Shamir. Spurious local minima are common in two-layer ReLU neural networks. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4433–4441. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/safran18a.html.
  32. The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. CoRR, abs/2006.01005, 2020. URL https://arxiv.org/abs/2006.01005.
  33. M. Soltanolkotabi. Learning relus via gradient descent. Advances in neural information processing systems, 30, 2017.
  34. Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, pages 3404–3413. PMLR, 2017.
  35. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
  36. No spurious local minima in a two hidden unit relu network. 2018.
  37. Y. Wu and H. H. Zhou. Randomly initialized em algorithm for two-component gaussian mixture achieves near optimality in O⁢(n)𝑂𝑛O(\sqrt{n})italic_O ( square-root start_ARG italic_n end_ARG ) iterations. arXiv preprint arXiv:1908.10935, 2019.
  38. G. Yehudai and S. Ohad. Learning a single neuron with gradient methods. In J. Abernethy and S. Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3756–3786. PMLR, 09–12 Jul 2020. URL https://proceedings.mlr.press/v125/yehudai20a.html.
  39. G. Yehudai and O. Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  40. Preconditioned gradient descent for overparameterized nonconvex burer–monteiro factorization with global optimality certification. arXiv preprint arXiv:2206.03345, 2022.
  41. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524–1534. PMLR, 2019.
  42. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.
  43. Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning, pages 7594–7602. PMLR, 2019.
  44. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577–4632. PMLR, 2021.
  45. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467–492, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Weihang Xu (4 papers)
  2. Simon S. Du (120 papers)
Citations (14)