Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Minimalist Example of Edge-of-Stability and Progressive Sharpening (2503.02809v1)

Published 4 Mar 2025 in cs.LG

Abstract: Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

Summary

  • The paper analyzes Gradient Descent dynamics during Edge-of-Stability using a minimalist two-layer linear network with bivariate input, exploring pre-EoS, progressive sharpening, and self-stabilization.
  • It shows Gradient Descent converges to global minima with sharpness bounded by 2/eta, driven by relevant features while irrelevant ones cause oscillations near 2/eta.
  • The findings explain Edge-of-Stability via a constrained trajectory view, align with empirical observations, and connect the dynamics to gradient flow solutions.

A Non-asymptotic Analysis of Gradient Descent Dynamics: Beyond Edge-of-Stability

This paper provides a rigorous exploration of the Edge-of-Stability (EoS) phenomenon in deep learning optimization, particularly focusing on scenarios involving large learning rates. To achieve this, the authors employ a minimalist model comprising a two-layer linear neural network with a two-dimensional input, wherein one dimension is relevant to the output, and the other is irrelevant. The paper thus transcends previous investigations limited to scalar networks by considering a more realistic bivariate input setting.

The major contribution of the paper lies in establishing the dynamics of gradient descent (GD) across three distinct phases: pre-EoS sharpening, progressive sharpening during EoS, and self-stabilization during EoS. The authors reveal that the GD trajectory ultimately converges to global minima, with sharpness bounded by 2/η2/\eta, despite non-monotonic loss descent. Specifically, they demonstrate that the relevant feature drives the loss reduction, whereas the irrelevant feature contributes primarily to oscillatory dynamics.

A central aspect of the results includes proving that the sharpness S(θ)S(\theta), measured by the largest eigenvalue of the Hessian matrix, fluctuates around 2/η2/\eta during the EoS stage, effectively encapsulating the GD trajectory within a stable region. This discovery aligns well with empirically observed phenomena in deep learning practice, where sharpness repeatedly oscillates near 2/η2/\eta.

Moreover, the paper identifies the constrained trajectory framework as a plausible explanation for the EoS dynamics, echoing the behavior of projected gradient descent under sharpness constraints. Intriguingly, the authors connect their findings to existing works by exploring the gradient flow solutions (GFS) and demonstrating how their GFS sharpness decreases monotonically along the GD trajectory.

From theoretical perspectives, the insights contribute significantly to the understanding of how GD behaves at large learning rates and EoS stages. Practical implications include informing optimization strategies that better accommodate large learning rates without compromising stability. These findings could stimulate further research into robust training methodologies and adaptive learning rate frameworks, especially for complex neural architectures.

In conclusion, this paper augments the knowledge of deep learning optimization, presenting a novel analysis of GD dynamics in EoS regimes, and connecting theoretical findings with empirical observations in realistic scenarios. As the field progresses, large learning rates will continue to demand attention for their dual role in accelerating training and triggering instability, underscoring the relevance and utility of this foundational work.

X Twitter Logo Streamline Icon: https://streamlinehq.com