Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

90 tokens/sec

Gemini 2.5 Pro Premium

54 tokens/sec

GPT-5 Medium

19 tokens/sec

GPT-5 High Premium

18 tokens/sec

GPT-4o

104 tokens/sec

DeepSeek R1 via Azure Premium

78 tokens/sec

GPT OSS 120B via Groq Premium

475 tokens/sec

Kimi K2 via Groq Premium

225 tokens/sec

2000 character limit reached

Edge of Stability (EoS) in Optimization

Updated 30 June 2025

Edge of Stability (EoS) is a regime where gradient descent converges despite using step sizes beyond the classical 2/L threshold.
Empirical and theoretical analyses in overparameterized diagonal linear networks reveal oscillatory dynamics that still lead to zero risk.
EoS alters the implicit bias, resulting in denser solutions compared to the minimum-ℓ2-norm bias observed with small learning rates.

The Edge of Stability (EoS) is a regime in gradient-based optimization where the dynamics of parameter updates defy classical stability guarantees yet still converge, often with notable implications for the optimization trajectory and implicit model bias. In the paper of parameterized linear regression with quadratic loss under a diagonal linear network (DLN) parameterization, EoS emerges as a precise and instructive phenomenon, illuminating how non-classical dynamics can appear even in paradigmatic convex problems when suitably overparameterized.

1. Definition and Characterization of Edge of Stability

The Edge of Stability refers to the dynamical regime of gradient descent (GD) where the step size $\eta$ exceeds the canonical stability threshold $2/L$, with $L$ being the global Lipschitz (smoothness) constant of the loss function. In this regime, the maximum eigenvalue of the loss Hessian, or sharpness $S_t = \lambda_{\max}(\nabla^2 \mathcal{L}(\mathbf{w}_t))$ , can regularly surpass $2/\eta$ during training, violating the Descent Lemma—a classical guarantee of monotonic loss decrease.

For an objective $\mathcal{L}$ that is $L$ -smooth, classical theory only guarantees convergence for $\eta < 2/L$ . EoS describes empirically observed cases where, for $\eta > 2/L$ , GD still reaches a solution despite transient oscillations in the loss or sharpness, and without strict monotonic descent.

In this context, EoS is formally present if, for some iterate $t$ ,

$S_t > \frac{2}{\eta}.$

This occurs consistently in the experiments and theoretical analysis of the referenced work for overparameterized DLNs with quadratic loss.

2. Main Results and Divergence from Previous Beliefs

Contrary to prior literature—which suggested that subquadratic loss functions are necessary for EoS to appear (i.e., only for $l''(a) \to 0$ as $|a| \to \infty$ )—this paper establishes, both empirically and theoretically, that EoS is manifest even for the canonical quadratic loss, $l(a) = a^2/4$ , given appropriate parameterization.

Diagonal linear networks, parametrized such that $\bm{\beta} = \bm{w}_+^2 - \bm{w}_-^2$ (each component is the difference of squares of parameter vectors), imbue the optimization landscape with higher-order structure. Under this parameterization, coupled with overparameterization ( $d > n$ for $n$ samples in $d$ dimensions) and non-degeneracy of the data, large GD step sizes beyond the classical threshold do not ensure divergence. Instead, the algorithm can still converge to zero risk, provided initialization and datawide conditions are met.

The critical finding is that EoS can emerge not because of special loss tail behavior, but due to the structure of the parameterization—even if the loss is a strictly convex quadratic.

3. Empirical and Theoretical Evidence for EoS

The paper presents a suite of empirical results, aligned with non-asymptotic theoretical analysis, for both the one-sample and multi-sample regression cases:

Empirical observations: For quadratic loss and DLN parameterization, GD with $\eta > 2/L$ exhibits oscillatory yet convergent dynamics in the interpolating regime ( $d \ge 2$ for 1-sample), and does not converge when $d = 1$ or the target is degenerate, consistent with theoretical predictions.
Theoretical analysis: Dynamical systems and bifurcation theory, tailored to time-varying coefficients, are deployed to show that in both $\mu\eta < 1$ and $\mu\eta > 1$ regimes ( $\mu$ is a data-dependent scale parameter), GD converges non-asymptotically to an interpolating solution despite temporary instability or loss oscillation. For instance, for a one-sample, two-feature regression, one obtains an exponential bound:

$|\langle \bm{\beta}_t - \bm{\beta}_\infty, X \rangle| \leq C \cdot \exp\left[ - \Theta(\mu\eta) (t - t_0) \right] |\langle \bm{\beta}_{t_0} - \bm{\beta}_\infty, X \rangle|,$

for $\mu\eta < 1$ (similar for $\mu\eta > 1$ ), capturing the transition across EoS.

The EoS appears if and only if the model is overparameterized and the data are non-degenerate. With insufficient overparameterization, divergence rather than EoS results.

4. Implicit Bias of Diagonal Linear Networks at EoS

The bias of a solution refers to which interpolating minimum GD selects out of the many possible in an overparameterized setting. For small learning rates or under continuous gradient flow, DLNs recover the minimum- $\ell_2$ -norm interpolator, favoring sparsity.

The paper shows that, at EoS ( $\mu\eta > 1$ ), this bias can be altered:

In the regime $\mu\eta < 1$ , standard bias remains, and solution error vanishes as initialization is reduced.
In the EoS regime, the residual error and implicit bias are bounded away from zero as initialization shrinks:

$\|\bm{\beta}_\infty - \bm{\beta}^*\| \leq \mathcal{C}(\mu\eta-1),$

with $\mathcal{C}$ a constant dependent on step size and data. Thus, large step sizes can steer the solution away from the classical bias, leading to denser solutions and potentially reduced sparsity or interpretability.

5. Mathematical Formulations and Regimes

Key quantities and dynamics include:

The quadratic parameterized regression loss,

$\mathcal{L}(\bm{w}_+, \bm{w}_-) = l(\langle X, \bm{w}_+^2 - \bm{w}_-^2 \rangle - y), \quad \text{with } l(a) = a^2/4.$
Gradient descent step:

$\bm{w}_{t+1} = \bm{w}_t - \eta \nabla_\bm{w} \mathcal{L}(\bm{w}_t).$
EoS regime identification:

$S_t := \lambda_{\max}(\nabla^2 \mathcal{L}(\bm{w}_t)) > \frac{2}{\eta} \text{ for some } t$
Non-asymptotic convergence for different regimes of $\mu\eta$ mirrors the stability threshold, with oscillations appearing beyond EoS.

6. Significance for Optimization and Model Design

These findings have several important consequences:

Optimization: Large learning rates can safely be used in overparameterized regimes, enabling faster convergence as long as overparameterization and non-degeneracy conditions are met, even when the loss is quadratic.
Parameterization: The structure of the parameterization is essential—quadratic lifting in parameters can grant EoS behavior otherwise absent.
Implicit Regularization: EoS can modify the implicit bias of GD, leading to solutions not attainable at small step sizes; thus, the choice of optimizer hyperparameters directly impacts the solution manifold, generalization, and potential sparsity.
Generalization: Since EoS is robustly observed in overparameterized and high-dimensional settings, it may explain both the success and sometimes unpredictable biases of modern learning algorithms.

Summary Table: EoS Regimes in Parameterized Linear Regression

Regime	Step Size/Condition	GD Dynamics	Implicit Bias
Classical ( $\eta < 2/L$ )	"Small"	Monotonic, stable, convergent	Minimum- $\ell_2$ -norm
EoS ( $\eta > 2/L$ ; DLN, $d > n$ )	"Large," overparam., non-degenerate	Oscillatory, convergent	May differ; error lower bounded, less sparse
EoS absent ( $d \leq n$ or degenerate)	Not applicable	Divergence, instability	No interpolator

The Edge of Stability in parameterized quadratic regression makes explicit that the confluence of parameterization, overparameterization, and learning rate determines not just convergence behavior but also the implicit regularization and the nature of model selection in modern learning systems.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now