Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents (2402.03220v3)

Published 5 Feb 2024 in stat.ML and cs.LG

Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. The staircase property: How hierarchical structure can guide deep learning. Advances in Neural Information Processing Systems, 34:26989–27002, 2021.
  2. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  3. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics, 2023.
  4. Out-of-equilibrium dynamical mean-field equations for the perceptron model. Journal of Physics A: Mathematical and Theoretical, 51(8):085002, 2018.
  5. G. E. Andrews. Special functions. Cambridge University Press, 2004.
  6. The committee machine: computational to statistical gaps in learning a two-layers neural network. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124023, Dec. 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ab43d2. URL http://dx.doi.org/10.1088/1742-5468/ab43d2.
  7. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37932–37946. Curran Associates, Inc., 2022.
  8. Learning in the presence of low-dimensional structure: a spiked random matrix perspective. In Neurips 2023, 2023.
  9. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
  10. M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, 2011.
  11. Symmetric langevin spin glass dynamics. The Annals of Probability, 25(3):1367–1422, 1997.
  12. Online stochastic gradient descent on non-convex losses from high-dimensional inference. Journal of Machine Learning Research, 22(106):1–51, 2021.
  13. On learning gaussian multi-index models with gradient flow. arXiv preprint arXiv:2310.19793, 2023.
  14. E. Bolthausen. An iterative construction of solutions of the tap equations for the sherrington–kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
  15. Spectrum dependent learning curves in kernel regression and wide neural networks. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1024–1034. PMLR, 13–18 Jul 2020.
  16. Out of equilibrium dynamics in spin-glasses and other glassy systems. Spin glasses and random fields, 12:161, 1998.
  17. The high-dimensional asymptotics of first order methods with random data. arXiv:2112.07572, 2021.
  18. S. Chen and R. Meka. Learning polynomials in few relevant dimensions. In Conference on Learning Theory, pages 1161–1227. PMLR, 2020.
  19. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  20. L. F. Cugliandolo. Dynamics of glassy systems. In Slow Relaxations and nonequilibrium dynamics in condensed matter. Springer, 2003.
  21. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10131–10143. Curran Associates, Inc., 2021.
  22. Neural networks can learn representations with gradient descent. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 5413–5452. PMLR, 02–05 Jul 2022.
  23. Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models. Technical report, Princeton, May 2023. arXiv:2305.10633 [cs, math, stat] type: article.
  24. How two-layer neural networks learn, one (giant) step at a time, 2023.
  25. Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, Apr 1999. doi: 10.1103/PhysRevLett.82.2975.
  26. H. Eissfeller and M. Opper. New method for studying the dynamics of disordered spin systems without finite-size effects. Physical review letters, 68(13):2094, 1992.
  27. H. Eissfeller and M. Opper. Mean-field Monte Carlo approach to the Sherrington-Kirkpatrick model with asymmetric couplings. Physical Review E, 50(2):709, 1994.
  28. Dynamical mean-field theory of strongly correlated fermion systems and the limit of infinite dimensions. Reviews of Modern Physics, 68(1):13, 1996.
  29. Rigorous dynamical mean field theory for stochastic gradient descent methods, 2023.
  30. Limitations of lazy training of two-layers neural network. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  31. When do neural networks outperform kernel methods? In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14820–14830. Curran Associates, Inc., 2020.
  32. Learning curves of generic features maps for realistic datasets with a teacher-student model. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18137–18151. Curran Associates, Inc., 2021.
  33. Phase retrieval in high dimensions: Statistical and computational phase transitions, 2020.
  34. S. S. Mannelli and P. Urbani. Just a momentum: Analytical study of momentum-based acceleration methods in paradigmatic high-dimensional non-convex problems. NeurIPS, 2021.
  35. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In Advances in Neural Information Processing Systems, pages 8676–8686, 2019a.
  36. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In international conference on machine learning, pages 4333–4342, 2019b.
  37. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020.
  38. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  39. F. Mignacco and P. Urbani. The effective noise of stochastic gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, aug 2022. doi: 10.1088/1742-5468/ac841d. URL https://doi.org/10.1088/1742-5468/ac841d.
  40. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
  41. Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem. Machine Learning: Science and Technology, 2(3):035029, 2021.
  42. A theory of non-linear feature learning with one gradient step in two-layer neural networks, 2023.
  43. A. Montanari and B. N. Saeed. Universality of empirical risk minimization. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4310–4312. PMLR, 02–05 Jul 2022.
  44. Gradient-based feature learning under structured data, 2023.
  45. G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022. doi: https://doi.org/10.1002/cpa.22074.
  46. Numerical implementation of dynamical mean field theory for disordered systems: application to the lotka–volterra model of ecosystems. Journal of Physics A: Mathematical and Theoretical, 52(48):484001, Nov. 2019. ISSN 1751-8121. doi: 10.1088/1751-8121/ab1f32. URL http://dx.doi.org/10.1088/1751-8121/ab1f32.
  47. D. Saad and S. A. Solla. On-line learning in soft committee machines. Physical Review E, 52(4):4225–4243, Oct. 1995. doi: 10.1103/PhysRevE.52.4225.
  48. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
  49. H. Sompolinsky and A. Zippelius. Dynamic theory of the spin-glass phase. Phys. Rev. Lett., 47:359–362, Aug 1981.
  50. Chaos in random neural networks. Phys. Rev. Lett., 61:259–262, Jul 1988.
  51. A. Zweig and J. Bruna. Symmetric single index learning, 2023.
Citations (18)

Summary

  • The paper establishes that multi-pass gradient descent using re-used batches substantially overcomes the limitations of one-pass methods.
  • It employs Dynamical Mean Field Theory to mathematically capture training dynamics and reveal hidden progress in network weights.
  • Empirical findings indicate that as few as two data cycles yield strong overlap with the target subspace, challenging conventional training paradigms.

Introduction

Understanding the training dynamics of neural networks can provide valuable insights into their learning capabilities and limitations. This paper thoroughly investigates the impact of reusing data batches in the training of two-layer neural networks specifically for multi-index target functions. The paper challenges the traditional paradigm that fresh data at each iteration is essential, demonstrating superiority of multi-pass gradient descent (multi-pass GD) over single-pass gradient descent.

Theoretical Framework

By employing Dynamical Mean Field Theory (DMFT), this research mathematically characterizes how two-layer networks can efficiently learn a wide range of functions in the high-dimensional limit. DMFT articulates the interplay between neural network weights and dataset characteristics, which traditionally involves considerable complexity. The crucial finding here is the identification of hidden progress during training. Even if the network's weights do not immediately align with the target function's relevant subspace, the multi-pass methodology enables the network to develop advantageous correlations after multiple iterations. This is contrasted with the inherent limitations of one-pass algorithms that can often stall learning due to the "curse" of information and leap exponents.

Empirical Findings

The theoretical insights are supported by numerical experiments that establish a clear dichotomy between one-pass and multi-pass GD. The multi-pass approach demonstrates rapid learning capabilities even for functions deemed unlearnable for single-pass algorithms. It was shown that significant learning can occur with as few as two cycles over the same data batch, leading to a positive overlap with the target subspace. This defies previously held conjectures about the sample complexity associated with certain target functions.

Implications and Conclusions

The work presented reshapes our understanding of the dataset's role in neural network training. The findings suggest that with larger batch sizes (proportional to the dimensionality of the input space), two-layer neural networks can benefit from repeatedly leveraging the same dataset, thereby efficiently learning a broad class of functions in finite time. This represents a substantial challenge to common wisdom in machine learning, advocating the necessity for fresh data in every training iteration.

By leveraging DMFT for rigorous proof, this paper also exemplifies the usefulness of statistical physics frameworks in analyzing complex machine learning systems. Moreover, the generalization from weak recovery of the target direction to strong recovery in terms of achieved accuracy is addressed, highlighting the practicality of the theoretical results. The implications could influence the design of future learning algorithms and contribute to more efficient training strategies, especially when access to extensive datasets may be limited.