How Two-Layer Neural Networks Learn, One (Giant) Step at a Time (2305.18270v3)

Published 29 May 2023 in stat.ML and cs.LG

Abstract: We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allows to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural networks adapt to features of the data.

Citations (13)

View on Semantic Scholar

Summary

The paper reveals that batch size thresholds critically enable two-layer networks to learn essential feature directions in early training.
It employs both single-step and iterative gradient descent analyses to track progressive subspace alignment and effective feature learning.
The study shows that appropriately scaling batch sizes enhances neuron specialization, accelerating multi-index function learning in high dimensions.

Analysis of Feature Learning in Two-Layer Neural Networks

The paper examines the dynamics by which two-layer neural networks adapt to target functions during the early phases of training. Focusing on a trained neural network's initial response to training data, the authors trace how feature learning emerges and improves approximation capabilities even after only a few gradient descent steps. This sophisticated inquiry identifies the thresholds that drive learning, particularly through large batch steps and multiple iterations, revealing how neural networks maneuver the task's structural nuances and utilize its data representation.

Context and Objectives

Neural networks are celebrated for their versatility; yet their adaptability—transforming to better respond to structured data—is less understood. Historically, training paradigms like linear regression benefited from simple, flat landscapes; neural networks face more involved topographies in parameter space. The paper therefore explores these dynamics, detailing how two-layer networks navigate through structured target functions—those condensed into linear combinations of a few relevant directions in a high-dimensional space.

Framework and Results

Examining the function f^\star encapsulated by a few orthonormal teacher vectors, the authors juxtapose full-batch gradient descent's effects under varied batch sizes (noted as n = \mathcal{O}(d^k)) and training stages. Their analyses reveal critical insights:

Single-Step Gradient Descent:
- Necessary Conditions: Clarified in Theorem 1, a single direction is trivially learned if n, the batch size, adheres to \mathcal{O}(d). This restriction means learning multiple indices requires larger batches.
- Higher Orders: As detailed in Theorem 2, learning multiple directions simultaneously is contingent on n = \mathcal{O}(d^\ell), where \ell pinpoints the function's leap index. This leap index pinpoints alignment, showing threshold bottlenecks for learning.
Iterative Improvements Across Steps:
- Progression is traced over T steps of gradient descent with fixed, large batch size scaling linearly \mathcal{O}(d). The emergence of "staircase" features denotes how neural networks progressively embed in the span of the target function's vectors U^{\star}.
- As each step reveals new aspects of the subspace conditioned on learned features, the authors articulate how Subspace Conditioning predicates learning regression milestones.
Interconnections and Extensions:
- By examining systems transitioning from lazy regimes to active feature learning, they leverage Hermite polynomial expansions to predict dynamics in feature space. This links the proximity of a network's parameter space to the target's likelihood distributions.

Implications and Theoretical Contributions

The findings delineate nuanced insights into two-layer network architecture tweaks' impacts. Efficiency emerges both from boosting neuron specialization and adapting learning rate strategy—and, more pertinently, the batch size's role in determining initial directions' learnability.

The nuances between small and large-scale regimes further suggest practical implications: multi-index functions can be absorbed with fewer computing rounds if batches are appropriately scaled. Contrasting stochastic gradient regimes, such insights ingeniously unravel the cascade by which iterative learning improves approximation consistency.

Furthermore, conjectures and tests highlight how over-parameterization refines network performance—enabling networks to surmount kernel methods' constraints and broaden specialization potential beyond random initialization features. The comprehensive technical rigor offered, especially invoking Gaussian equivalence principles, assures that these mathematical forecasts possess voluntary interest for extending two-layer architectures.

Conclusion

This comprehensive paper reveals how the dynamic structure of two-layer neural networks in high-dimensional spaces influences learning, demonstrating vividly that better feature adaptation is quantifiable even in regimented early training phases. The research not only develops a taxonomy for learning indices but also lends empirical substance to the otherwise abstract terrain of early-stage neural adaptation, thereby setting a foundation for nuanced explorations in large-scale AI models.

PDF Markdown

Related Papers

YouTube

Show All Videos