- The paper analyzes Gradient Descent dynamics during Edge-of-Stability using a minimalist two-layer linear network with bivariate input, exploring pre-EoS, progressive sharpening, and self-stabilization.
- It shows Gradient Descent converges to global minima with sharpness bounded by 2/eta, driven by relevant features while irrelevant ones cause oscillations near 2/eta.
- The findings explain Edge-of-Stability via a constrained trajectory view, align with empirical observations, and connect the dynamics to gradient flow solutions.
A Non-asymptotic Analysis of Gradient Descent Dynamics: Beyond Edge-of-Stability
This paper provides a rigorous exploration of the Edge-of-Stability (EoS) phenomenon in deep learning optimization, particularly focusing on scenarios involving large learning rates. To achieve this, the authors employ a minimalist model comprising a two-layer linear neural network with a two-dimensional input, wherein one dimension is relevant to the output, and the other is irrelevant. The paper thus transcends previous investigations limited to scalar networks by considering a more realistic bivariate input setting.
The major contribution of the paper lies in establishing the dynamics of gradient descent (GD) across three distinct phases: pre-EoS sharpening, progressive sharpening during EoS, and self-stabilization during EoS. The authors reveal that the GD trajectory ultimately converges to global minima, with sharpness bounded by 2/η, despite non-monotonic loss descent. Specifically, they demonstrate that the relevant feature drives the loss reduction, whereas the irrelevant feature contributes primarily to oscillatory dynamics.
A central aspect of the results includes proving that the sharpness S(θ), measured by the largest eigenvalue of the Hessian matrix, fluctuates around 2/η during the EoS stage, effectively encapsulating the GD trajectory within a stable region. This discovery aligns well with empirically observed phenomena in deep learning practice, where sharpness repeatedly oscillates near 2/η.
Moreover, the paper identifies the constrained trajectory framework as a plausible explanation for the EoS dynamics, echoing the behavior of projected gradient descent under sharpness constraints. Intriguingly, the authors connect their findings to existing works by exploring the gradient flow solutions (GFS) and demonstrating how their GFS sharpness decreases monotonically along the GD trajectory.
From theoretical perspectives, the insights contribute significantly to the understanding of how GD behaves at large learning rates and EoS stages. Practical implications include informing optimization strategies that better accommodate large learning rates without compromising stability. These findings could stimulate further research into robust training methodologies and adaptive learning rate frameworks, especially for complex neural architectures.
In conclusion, this paper augments the knowledge of deep learning optimization, presenting a novel analysis of GD dynamics in EoS regimes, and connecting theoretical findings with empirical observations in realistic scenarios. As the field progresses, large learning rates will continue to demand attention for their dual role in accelerating training and triggering instability, underscoring the relevance and utility of this foundational work.