Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Edge of Stability (EoS) in Optimization

Updated 30 June 2025
  • Edge of Stability (EoS) is a regime where gradient descent converges despite using step sizes beyond the classical 2/L threshold.
  • Empirical and theoretical analyses in overparameterized diagonal linear networks reveal oscillatory dynamics that still lead to zero risk.
  • EoS alters the implicit bias, resulting in denser solutions compared to the minimum-ℓ2-norm bias observed with small learning rates.

The Edge of Stability (EoS) is a regime in gradient-based optimization where the dynamics of parameter updates defy classical stability guarantees yet still converge, often with notable implications for the optimization trajectory and implicit model bias. In the paper of parameterized linear regression with quadratic loss under a diagonal linear network (DLN) parameterization, EoS emerges as a precise and instructive phenomenon, illuminating how non-classical dynamics can appear even in paradigmatic convex problems when suitably overparameterized.

1. Definition and Characterization of Edge of Stability

The Edge of Stability refers to the dynamical regime of gradient descent (GD) where the step size η\eta exceeds the canonical stability threshold $2/L$, with LL being the global Lipschitz (smoothness) constant of the loss function. In this regime, the maximum eigenvalue of the loss Hessian, or sharpness St=λmax(2L(wt))S_t = \lambda_{\max}(\nabla^2 \mathcal{L}(\mathbf{w}_t)), can regularly surpass 2/η2/\eta during training, violating the Descent Lemma—a classical guarantee of monotonic loss decrease.

For an objective L\mathcal{L} that is LL-smooth, classical theory only guarantees convergence for η<2/L\eta < 2/L. EoS describes empirically observed cases where, for η>2/L\eta > 2/L, GD still reaches a solution despite transient oscillations in the loss or sharpness, and without strict monotonic descent.

In this context, EoS is formally present if, for some iterate tt,

St>2η.S_t > \frac{2}{\eta}.

This occurs consistently in the experiments and theoretical analysis of the referenced work for overparameterized DLNs with quadratic loss.

2. Main Results and Divergence from Previous Beliefs

Contrary to prior literature—which suggested that subquadratic loss functions are necessary for EoS to appear (i.e., only for l(a)0l''(a) \to 0 as a|a| \to \infty)—this paper establishes, both empirically and theoretically, that EoS is manifest even for the canonical quadratic loss, l(a)=a2/4l(a) = a^2/4, given appropriate parameterization.

Diagonal linear networks, parametrized such that β=w+2w2\bm{\beta} = \bm{w}_+^2 - \bm{w}_-^2 (each component is the difference of squares of parameter vectors), imbue the optimization landscape with higher-order structure. Under this parameterization, coupled with overparameterization (d>nd > n for nn samples in dd dimensions) and non-degeneracy of the data, large GD step sizes beyond the classical threshold do not ensure divergence. Instead, the algorithm can still converge to zero risk, provided initialization and datawide conditions are met.

The critical finding is that EoS can emerge not because of special loss tail behavior, but due to the structure of the parameterization—even if the loss is a strictly convex quadratic.

3. Empirical and Theoretical Evidence for EoS

The paper presents a suite of empirical results, aligned with non-asymptotic theoretical analysis, for both the one-sample and multi-sample regression cases:

  • Empirical observations: For quadratic loss and DLN parameterization, GD with η>2/L\eta > 2/L exhibits oscillatory yet convergent dynamics in the interpolating regime (d2d \ge 2 for 1-sample), and does not converge when d=1d = 1 or the target is degenerate, consistent with theoretical predictions.
  • Theoretical analysis: Dynamical systems and bifurcation theory, tailored to time-varying coefficients, are deployed to show that in both μη<1\mu\eta < 1 and μη>1\mu\eta > 1 regimes (μ\mu is a data-dependent scale parameter), GD converges non-asymptotically to an interpolating solution despite temporary instability or loss oscillation. For instance, for a one-sample, two-feature regression, one obtains an exponential bound:

    βtβ,XCexp[Θ(μη)(tt0)]βt0β,X,|\langle \bm{\beta}_t - \bm{\beta}_\infty, X \rangle| \leq C \cdot \exp\left[ - \Theta(\mu\eta) (t - t_0) \right] |\langle \bm{\beta}_{t_0} - \bm{\beta}_\infty, X \rangle|,

    for μη<1\mu\eta < 1 (similar for μη>1\mu\eta > 1), capturing the transition across EoS.

The EoS appears if and only if the model is overparameterized and the data are non-degenerate. With insufficient overparameterization, divergence rather than EoS results.

4. Implicit Bias of Diagonal Linear Networks at EoS

The bias of a solution refers to which interpolating minimum GD selects out of the many possible in an overparameterized setting. For small learning rates or under continuous gradient flow, DLNs recover the minimum-2\ell_2-norm interpolator, favoring sparsity.

The paper shows that, at EoS (μη>1\mu\eta > 1), this bias can be altered:

  • In the regime μη<1\mu\eta < 1, standard bias remains, and solution error vanishes as initialization is reduced.
  • In the EoS regime, the residual error and implicit bias are bounded away from zero as initialization shrinks:

    ββC(μη1),\|\bm{\beta}_\infty - \bm{\beta}^*\| \leq \mathcal{C}(\mu\eta-1),

    with C\mathcal{C} a constant dependent on step size and data. Thus, large step sizes can steer the solution away from the classical bias, leading to denser solutions and potentially reduced sparsity or interpretability.

5. Mathematical Formulations and Regimes

Key quantities and dynamics include:

  • The quadratic parameterized regression loss,

    L(w+,w)=l(X,w+2w2y),with l(a)=a2/4.\mathcal{L}(\bm{w}_+, \bm{w}_-) = l(\langle X, \bm{w}_+^2 - \bm{w}_-^2 \rangle - y), \quad \text{with } l(a) = a^2/4.

  • Gradient descent step:

    $\bm{w}_{t+1} = \bm{w}_t - \eta \nabla_\bm{w} \mathcal{L}(\bm{w}_t).$

  • EoS regime identification:

    St:=λmax(2L(wt))>2η for some tS_t := \lambda_{\max}(\nabla^2 \mathcal{L}(\bm{w}_t)) > \frac{2}{\eta} \text{ for some } t

  • Non-asymptotic convergence for different regimes of μη\mu\eta mirrors the stability threshold, with oscillations appearing beyond EoS.

6. Significance for Optimization and Model Design

These findings have several important consequences:

  • Optimization: Large learning rates can safely be used in overparameterized regimes, enabling faster convergence as long as overparameterization and non-degeneracy conditions are met, even when the loss is quadratic.
  • Parameterization: The structure of the parameterization is essential—quadratic lifting in parameters can grant EoS behavior otherwise absent.
  • Implicit Regularization: EoS can modify the implicit bias of GD, leading to solutions not attainable at small step sizes; thus, the choice of optimizer hyperparameters directly impacts the solution manifold, generalization, and potential sparsity.
  • Generalization: Since EoS is robustly observed in overparameterized and high-dimensional settings, it may explain both the success and sometimes unpredictable biases of modern learning algorithms.

Summary Table: EoS Regimes in Parameterized Linear Regression

Regime Step Size/Condition GD Dynamics Implicit Bias
Classical (η<2/L\eta < 2/L) "Small" Monotonic, stable, convergent Minimum-2\ell_2-norm
EoS (η>2/L\eta > 2/L; DLN, d>nd > n) "Large," overparam., non-degenerate Oscillatory, convergent May differ; error lower bounded, less sparse
EoS absent (dnd \leq n or degenerate) Not applicable Divergence, instability No interpolator

The Edge of Stability in parameterized quadratic regression makes explicit that the confluence of parameterization, overparameterization, and learning rate determines not just convergence behavior but also the implicit regularization and the nature of model selection in modern learning systems.