Asymptotics of feature learning in two-layer networks after one gradient-step (2402.04980v2)

Published 7 Feb 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime.

Authors (7)

Hugo Cui (16 papers)
Luca Pesce (11 papers)
Yatin Dandi (23 papers)
Florent Krzakala (179 papers)
Yue M. Lu (52 papers)
Lenka Zdeborová (182 papers)
Bruno Loureiro (46 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper provides an exact asymptotic characterization of test error after one large gradient descent step, marking a shift from lazy training to effective feature learning.
The paper introduces a spiked Random Features model that connects network training to a conditional Gaussian framework, enhancing our understanding of non-linear feature transformations.
The paper reveals that a large learning rate significantly boosts feature specialization, reducing sample complexity compared to traditional kernel methods.

Asymptotics of Feature Learning in Two-Layer Networks after One Gradient-Step

The paper "Asymptotics of Feature Learning in Two-Layer Networks after One Gradient-Step" explores the initial stages of learning in two-layer neural networks, specifically focusing on the transformations that occur following a single gradient descent step. The paper presented provides a comprehensive asymptotic analysis of the generalization error within a high-dimensional setting where the number of samples $n$ , the width $p$ , and the input dimension $d$ scale proportionally. The authors leverage a non-linear spiked matrix model analysis and recent advances in Gaussian universality to detail the performance improvements beyond traditional kernel methods when feature learning is introduced through a substantial learning rate.

Key Contributions and Findings

Exact Asymptotics for Generalization Error: The paper offers an exact asymptotic characterization of the test error when the model's first layer weights undergo a single, large gradient step. This characterization outlines the benefits of feature learning over the lazy kernel regime, particularly with a large learning rate scaling as $\eta = \Theta_d(d)$ . The findings delineate how neural networks, even at initialization, extend beyond expressing linear functions to effectively learning non-linear aspects aligned with the gradient direction.
Spiked Random Features (sRF) Model: The researchers introduce an equivalence between trained two-layer networks and a spiked Random Features model. This model comprises a bulk matrix and a rank-one spike, representing an important theoretical advancement in understanding how initial training steps transform feature representations within the network. The paper derives the parameters of this sRF model, such as bulk variance and spike strength, following the gradient update.
Conditional Gaussian Equivalence: Building upon prior theoretical constructs, the authors demonstrate a conditional Gaussian equivalence for the spiked Random Features model. This demonstrates that under asymptotic conditions, the learning properties of the sRF model align with a Gaussian model, simplifying the understanding and analysis of the learning dynamics.
Improvement over Kernel Methods: A key insight from the paper is the network's ability to achieve better generalization compared to kernel methods, attributed to the efficient learning of non-linear functions. Unlike kernel methods that primarily depend on linear approximations, feature learning enables a more nuanced data adaptation, significantly enhancing performance in data-limited regimes.
Further Insights and Boundary Cases: The discussion includes scenarios that contrast simple random features with spiked versions, highlighting their differences in learning efficacy. Even small spikes introduce qualitative benefits, lifting the sample complexity burdens associated with traditional kernel approaches.

Implications and Future Directions

The implications of this research are significant both theoretically and practically. From a theoretical perspective, it expands the understanding of neural network initialization and the rapid advancements of representational power in the early stages of training. Practically, it suggests methodologies for optimizing initialization and learning rate strategies to harness feature learning advantages effectively.

Future studies could explore multiple gradient steps and extend characterizations to non-zero alignments with various learning targets. Moreover, investigating networks with varying degrees of depth could further elucidate the interplay between gradient dynamics, feature specialization, and overall learning efficiency.

The findings and methodologies outlined provide a robust framework for interpreting the initial learning phases in neural architectures, paving the way for more refined training paradigms that capitalize on feature learning breakthroughs revealed in this asymptotic analysis.

Related Papers

Tweets

https://twitter.com/zdeborova/status/1760622298980081974

https://twitter.com/_brloureiro/status/1755558875229466910

https://twitter.com/StatMLPapers/status/1755457489196163329

https://twitter.com/KwekuOA/status/1761027512476254415

YouTube

Show All Videos