The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents (2402.03220v3)

Published 5 Feb 2024 in stat.ML and cs.LG

Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.

References (51)

Citations (18)

View on Semantic Scholar

Summary

The paper establishes that multi-pass gradient descent using re-used batches substantially overcomes the limitations of one-pass methods.
It employs Dynamical Mean Field Theory to mathematically capture training dynamics and reveal hidden progress in network weights.
Empirical findings indicate that as few as two data cycles yield strong overlap with the target subspace, challenging conventional training paradigms.

Introduction

Understanding the training dynamics of neural networks can provide valuable insights into their learning capabilities and limitations. This paper thoroughly investigates the impact of reusing data batches in the training of two-layer neural networks specifically for multi-index target functions. The paper challenges the traditional paradigm that fresh data at each iteration is essential, demonstrating superiority of multi-pass gradient descent (multi-pass GD) over single-pass gradient descent.

Theoretical Framework

By employing Dynamical Mean Field Theory (DMFT), this research mathematically characterizes how two-layer networks can efficiently learn a wide range of functions in the high-dimensional limit. DMFT articulates the interplay between neural network weights and dataset characteristics, which traditionally involves considerable complexity. The crucial finding here is the identification of hidden progress during training. Even if the network's weights do not immediately align with the target function's relevant subspace, the multi-pass methodology enables the network to develop advantageous correlations after multiple iterations. This is contrasted with the inherent limitations of one-pass algorithms that can often stall learning due to the "curse" of information and leap exponents.

Empirical Findings

The theoretical insights are supported by numerical experiments that establish a clear dichotomy between one-pass and multi-pass GD. The multi-pass approach demonstrates rapid learning capabilities even for functions deemed unlearnable for single-pass algorithms. It was shown that significant learning can occur with as few as two cycles over the same data batch, leading to a positive overlap with the target subspace. This defies previously held conjectures about the sample complexity associated with certain target functions.

Implications and Conclusions

The work presented reshapes our understanding of the dataset's role in neural network training. The findings suggest that with larger batch sizes (proportional to the dimensionality of the input space), two-layer neural networks can benefit from repeatedly leveraging the same dataset, thereby efficiently learning a broad class of functions in finite time. This represents a substantial challenge to common wisdom in machine learning, advocating the necessity for fresh data in every training iteration.

By leveraging DMFT for rigorous proof, this paper also exemplifies the usefulness of statistical physics frameworks in analyzing complex machine learning systems. Moreover, the generalization from weak recovery of the target direction to strong recovery in terms of achieved accuracy is addressed, highlighting the practicality of the theoretical results. The implications could influence the design of future learning algorithms and contribute to more efficient training strategies, especially when access to extensive datasets may be limited.

PDF Markdown

Tweets

https://twitter.com/zdeborova/status/1754754488341082276

https://twitter.com/DandiYatin/status/1755226432186286182

https://twitter.com/StatMLPapers/status/1754732744620294437

https://twitter.com/KrzakalaF/status/1754792791203250609

https://twitter.com/arxivsanitybot/status/1755410706692800733