The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents (2402.03220v3)
Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.
- The staircase property: How hierarchical structure can guide deep learning. Advances in Neural Information Processing Systems, 34:26989–27002, 2021.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics, 2023.
- Out-of-equilibrium dynamical mean-field equations for the perceptron model. Journal of Physics A: Mathematical and Theoretical, 51(8):085002, 2018.
- G. E. Andrews. Special functions. Cambridge University Press, 2004.
- The committee machine: computational to statistical gaps in learning a two-layers neural network. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124023, Dec. 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ab43d2. URL http://dx.doi.org/10.1088/1742-5468/ab43d2.
- High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37932–37946. Curran Associates, Inc., 2022.
- Learning in the presence of low-dimensional structure: a spiked random matrix perspective. In Neurips 2023, 2023.
- Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
- M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, 2011.
- Symmetric langevin spin glass dynamics. The Annals of Probability, 25(3):1367–1422, 1997.
- Online stochastic gradient descent on non-convex losses from high-dimensional inference. Journal of Machine Learning Research, 22(106):1–51, 2021.
- On learning gaussian multi-index models with gradient flow. arXiv preprint arXiv:2310.19793, 2023.
- E. Bolthausen. An iterative construction of solutions of the tap equations for the sherrington–kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1024–1034. PMLR, 13–18 Jul 2020.
- Out of equilibrium dynamics in spin-glasses and other glassy systems. Spin glasses and random fields, 12:161, 1998.
- The high-dimensional asymptotics of first order methods with random data. arXiv:2112.07572, 2021.
- S. Chen and R. Meka. Learning polynomials in few relevant dimensions. In Conference on Learning Theory, pages 1161–1227. PMLR, 2020.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- L. F. Cugliandolo. Dynamics of glassy systems. In Slow Relaxations and nonequilibrium dynamics in condensed matter. Springer, 2003.
- Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10131–10143. Curran Associates, Inc., 2021.
- Neural networks can learn representations with gradient descent. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 5413–5452. PMLR, 02–05 Jul 2022.
- Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models. Technical report, Princeton, May 2023. arXiv:2305.10633 [cs, math, stat] type: article.
- How two-layer neural networks learn, one (giant) step at a time, 2023.
- Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, Apr 1999. doi: 10.1103/PhysRevLett.82.2975.
- H. Eissfeller and M. Opper. New method for studying the dynamics of disordered spin systems without finite-size effects. Physical review letters, 68(13):2094, 1992.
- H. Eissfeller and M. Opper. Mean-field Monte Carlo approach to the Sherrington-Kirkpatrick model with asymmetric couplings. Physical Review E, 50(2):709, 1994.
- Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions. Reviews of Modern Physics, 68(1):13, 1996.
- Rigorous dynamical mean field theory for stochastic gradient descent methods, 2023.
- Limitations of lazy training of two-layers neural network. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- When do neural networks outperform kernel methods? In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14820–14830. Curran Associates, Inc., 2020.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18137–18151. Curran Associates, Inc., 2021.
- Phase retrieval in high dimensions: Statistical and computational phase transitions, 2020.
- S. S. Mannelli and P. Urbani. Just a momentum: Analytical study of momentum-based acceleration methods in paradigmatic high-dimensional non-convex problems. NeurIPS, 2021.
- Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In Advances in Neural Information Processing Systems, pages 8676–8686, 2019a.
- Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In international conference on machine learning, pages 4333–4342, 2019b.
- Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- F. Mignacco and P. Urbani. The effective noise of stochastic gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, aug 2022. doi: 10.1088/1742-5468/ac841d. URL https://doi.org/10.1088/1742-5468/ac841d.
- Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
- Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem. Machine Learning: Science and Technology, 2(3):035029, 2021.
- A theory of non-linear feature learning with one gradient step in two-layer neural networks, 2023.
- A. Montanari and B. N. Saeed. Universality of empirical risk minimization. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4310–4312. PMLR, 02–05 Jul 2022.
- Gradient-based feature learning under structured data, 2023.
- G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022. doi: https://doi.org/10.1002/cpa.22074.
- Numerical implementation of dynamical mean field theory for disordered systems: application to the lotka–volterra model of ecosystems. Journal of Physics A: Mathematical and Theoretical, 52(48):484001, Nov. 2019. ISSN 1751-8121. doi: 10.1088/1751-8121/ab1f32. URL http://dx.doi.org/10.1088/1751-8121/ab1f32.
- D. Saad and S. A. Solla. On-line learning in soft committee machines. Physical Review E, 52(4):4225–4243, Oct. 1995. doi: 10.1103/PhysRevE.52.4225.
- J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
- H. Sompolinsky and A. Zippelius. Dynamic theory of the spin-glass phase. Phys. Rev. Lett., 47:359–362, Aug 1981.
- Chaos in random neural networks. Phys. Rev. Lett., 61:259–262, Jul 1988.
- A. Zweig and J. Bruna. Symmetric single index learning, 2023.