Edge of Stability in Optimization

Updated 4 July 2026

Edge of stability is defined as a regime where training reaches a critical curvature threshold (≈2/η) and oscillates while still converging.
It examines how optimizers like GD, SAM, and momentum methods adapt their update rules to maintain near-critical sharpness during training.
Empirical and theoretical studies reveal actionable insights on stability boundaries, optimizer-specific thresholds, and effects on representation dynamics.

Searching arXiv for recent and foundational papers on the edge of stability phenomenon. The edge of stability phenomenon is a regime of iterative optimization in which training approaches the discrete-time stability boundary implied by a local quadratic model of the loss. In its canonical form for full-batch gradient descent (GD) with step size $\eta$ , the sharpness—typically the largest eigenvalue of the training-loss Hessian—rises during training until it is approximately $2/\eta$ , after which it hovers near that value; the loss becomes non-monotone on short timescales, yet continues to decrease over longer horizons. Subsequent work has generalized this picture to optimizer-dependent boundaries, including Sharpness-Aware Minimization (SAM), non-Euclidean descent, stochastic and momentum methods, and related NTK and bifurcation analyses, while also showing that the phenomenon is sensitive to loss geometry, data regime, and alignment structure (Cohen et al., 2021, Long et al., 2023, Islamov et al., 5 Mar 2026).

1. Operational definition and canonical thresholds

In the standard formulation, sharpness is the maximum eigenvalue of the training-loss Hessian. For vanilla GD on a quadratic $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ , the update along an eigen-direction with eigenvalue $a$ is unstable if $a>2/\eta$ , since

$x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$

and $(1-\eta a)<-1$ when $a>2/\eta$ . This motivates the operational definition of edge of stability: a regime where

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$

the loss is jagged or locally unstable, and yet long-horizon optimization continues (Cohen et al., 2021).

A closely related formulation appears in work that treats the phenomenon as the interaction between curvature and the stability limit of the update rule under a quadratic-loss approximation. In that language, training typically has a progressive sharpening phase, where the leading Hessian eigenvalue rises while the loss decreases monotonically, followed by an edge-of-stability phase in which the leading eigenvalue oscillates near the optimizer’s divergence threshold and the loss becomes locally unstable while still decreasing over longer time frames (Iordan et al., 2023).

Setting	Curvature statistic	Threshold
GD	$\lambda_{\max}(\nabla^2 L)$	$2/\eta$ 0
Polyak momentum	$2/\eta$ 1	$2/\eta$ 2
Nesterov momentum	$2/\eta$ 3	$2/\eta$ 4
SAM	$2/\eta$ 5	$2/\eta$ 6
Non-Euclidean GD	$2/\eta$ 7 or $2/\eta$ 8	$2/\eta$ 9

The table emphasizes that the phrase does not denote a single invariant scalar across all optimizers. Rather, the stability boundary depends on the geometry of the update rule: Euclidean GD yields the classical $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 0, momentum changes the maximum stable sharpness, SAM introduces a gradient-dependent edge, and non-Euclidean descent replaces Euclidean sharpness with directional smoothness or generalized sharpness (Cohen et al., 2021, Long et al., 2023, Islamov et al., 5 Mar 2026).

2. Quadratic local theory and optimizer-dependent edges

For GD, a local exact quadratic model around the current iterate $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 1,

$f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 2

with $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 3 and $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 4, gives

$f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 5

Hence $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 6 implies one-step decrease. Moreover, if $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 7 is aligned with a principal eigenvector of $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 8 with nonnegative eigenvalue, then the sign of $f(\mathbf{x})=\tfrac12 \mathbf{x}^\top A\mathbf{x}+b^\top\mathbf{x}+c$ 9 is exactly the sign of $a$ 0. This is why $a$ 1 is interpreted as a local stability boundary rather than merely a heuristic scale (Long et al., 2023).

For SAM, the update is

$a$ 2

and the same local quadratic analysis yields a different stability boundary. Assuming $a$ 3 and $a$ 4, the paper derives the SAM edge of stability

$a$ 5

The main theoretical difference from GD is that this boundary depends on the gradient norm $a$ 6 as well as $a$ 7 and $a$ 8, and it becomes smaller as training progresses and $a$ 9 decreases. Empirically, on a depth-4 fully connected network on MNIST with quadratic loss, a CNN on CIFAR10 trained on the first 1000 examples with quadratic loss, and a Transformer LLM on tiny_shakespeare with minibatch training, the Hessian norm under SAM tracks this predicted boundary rather than $a>2/\eta$ 0 (Long et al., 2023).

A broader generalization replaces Euclidean smoothness by directional smoothness. For non-Euclidean GD,

$a>2/\eta$ 1

the key identity is

$a>2/\eta$ 2

Thus the loss decreases if and only if

$a>2/\eta$ 3

Under a second-order approximation, this yields a generalized sharpness

$a>2/\eta$ 4

which recovers vanilla GD, preconditioned GD, $a>2/\eta$ 5-descent, Block CD, Spectral GD, and Muon without momentum as special cases. In experiments, the relevant generalized sharpness rises during progressive sharpening and then hovers at or slightly above $a>2/\eta$ 6 for these non-Euclidean methods (Islamov et al., 5 Mar 2026).

3. Dynamical mechanisms: self-stabilization, bifurcation, and oscillatory structure

The empirical signatures of edge-of-stability training—alternating iterates, loss spikes, and sharpness stabilization—have motivated several mechanistic models. One line of work shows that, after a canonical reparameterization, different GD trajectories align on a bifurcation diagram independent of initialization. In a two-layer fully connected linear network and a single-neuron nonlinear network trained with a single data point, the reduced dynamics take the form

$a>2/\eta$ 7

with a period-doubling bifurcation at $a>2/\eta$ 8, corresponding to the normalized sharpness threshold $a>2/\eta$ 9. In this view, progressive sharpening is a slow drift toward the bifurcation point, and the EoS phase is the regime where the fixed point loses stability and a stable period-2 orbit appears (Song et al., 2023).

A more explicit continuous-time model is Edge Flow, which decomposes GD near EoS as

$x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 0

with center $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 1, oscillation direction $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 2, and oscillation magnitude $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 3. The coupled ODEs are

$x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 4

$x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 5

$x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 6

The model attributes stabilization to a feedback loop: if sharpness rises above $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 7, $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 8 grows; the symmetrized center dynamics then introduce a third-derivative correction that pushes the center toward lower-curvature regions; if sharpness falls below $x_{t+1}-x^*=(1-\eta a)(x_t-x^*),$ 9, $(1-\eta a)<-1$ 0 decays and ordinary progressive sharpening resumes (Marion, 16 Jun 2026).

A discrete variational explanation appears in work introducing the edge coupling

$(1-\eta a)<-1$ 1

The condition $(1-\eta a)<-1$ 2 is exactly the GD update $(1-\eta a)<-1$ 3. Differencing this condition across consecutive steps yields

$(1-\eta a)<-1$ 4

with $(1-\eta a)<-1$ 5 and $(1-\eta a)<-1$ 6 the segment-averaged Hessian, making the boundary $(1-\eta a)<-1$ 7 explicit. A second-order expansion gives a telescoping loss identity that forces a weighted average curvature toward $(1-\eta a)<-1$ 8, and a mean value theorem argument localizes the averaged curvature to the true Hessian at interior points of each step segment (Litman, 22 Apr 2026).

Minimalist examples isolate the same structure in tractable nonconvex systems. A degree-4 scalar product objective yields a two-step map with a stabilizing cubic term and a parabolic slow manifold; under explicit local conditions, GD converges to a minimum whose sharpness lies in

$(1-\eta a)<-1$ 9

capturing both convergence above the naive threshold and endpoint sharpness slightly below $a>2/\eta$ 0 (Zhu et al., 2022). In deep linear networks, the regime beyond EoS follows a period-doubling route to chaos; oscillations occur in a small subspace whose dimension is determined by the learning rate, and the symmetry-induced balancing gap from gradient flow breaks at EoS and decays monotonically to zero (Ghosh et al., 27 Feb 2025).

4. Stochastic, momentum, and batch-sensitive formulations

Mini-batch training modifies the deterministic picture in several distinct ways. One approach extends full-batch self-stabilization to stochastic self-stabilization and derives a closed-form equilibrium sharpness gap

$a>2/\eta$ 1

where $a>2/\eta$ 2 is the progressive sharpening rate, $a>2/\eta$ 3 is the self-stabilization strength, and $a>2/\eta$ 4 is the gradient-noise variance projected onto the top Hessian eigenvector. This predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset (Liao et al., 22 Apr 2026).

A complementary high-dimensional analysis identifies conservative sharpening and a distinct stochastic edge of stability. In that theory, minibatch noise suppresses the later growth of already-large curvature modes, and the relevant instability criterion is often not the top Hessian eigenvalue but

$a>2/\eta$ 5

with small-curvature approximation

$a>2/\eta$ 6

This yields a stochastic edge sensitive to the NTK spectrum and, at small batch size, often to its trace rather than to $a>2/\eta$ 7 alone (Agarwala et al., 2024).

Momentum introduces an explicitly batch-size-dependent edge. For SGDM, Batch Sharpness

$a>2/\eta$ 8

stabilizes at

$a>2/\eta$ 9

in the small-batch regime and at

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 0

in the large-batch regime; for SGDN, the large-batch plateau is

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 1

The paper’s central point is that momentum does not merely rescale the step size by a single deterministic factor; rather, the operative edge depends on whether the dynamics are noise-dominated or near-deterministic (Andreyev et al., 15 Apr 2026).

Another stochastic formulation, developed for multiclass cross-entropy in linear classifiers and two-layer neural networks, replaces pointwise monotonicity by stochastic Lyapunov stability. There the stable set is defined by a loss threshold $\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 2, SGD alternates between edge-of-stability excursions and stable periods, and self-stabilization guarantees return to stability in a fixed number of iterations with high probability (Emmanouilidis et al., 29 Jun 2026).

5. Manifestations across tasks, losses, and representation dynamics

The phenomenon is not uniform across domains. In off-policy deep reinforcement learning, it appears clearly for DQN with a Huber loss, especially in offline settings, where the leading Hessian eigenvalue rises to the quadratic threshold and then fluctuates around it. By contrast, C51 with a cross-entropy loss does not show a consistent edge-of-stability effect: in offline learning $\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 3 often stays below the threshold or only briefly approaches it, while in online learning it can become much larger than the threshold and then decrease later. The paper therefore attributes a major role to loss geometry, not merely network architecture (Iordan et al., 2023).

The phenomenon also has a representation-level formulation. In NTK analyses, edge-of-stability training is associated not only with oscillation of the largest eigenvalue near a threshold of order $\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 4, but also with eigenvector rotation. In a two-layer linear theory with

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 5

the ratio

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 6

governs an alignment shift: larger learning rates cause the target to align more strongly with the leading eigenvectors of the final NTK, and the sharpness-decreasing subphases of EoS coincide with sudden gains in alignment (Jiang et al., 17 Jul 2025).

EoS can also act selectively across the data distribution. One study uses a branching intervention—continuing at the same learning rate versus halving the learning rate at the onset of EoS—to show that staying at EoS improves some groups while suppressing others. Two necessary conditions are identified for a group to benefit: its aggregate gradient must align with the top Hessian eigenvector, and its gradient magnitude must remain non-vanishing over time. Under cross-entropy loss, gradient saturation can decouple confidently classified groups, shifting the EoS advantage to output-outliers whose gradients persist (Kwag et al., 2 Jun 2026).

A distinct feature-learning consequence appears in simplified two-layer ReLU models. There, a sharp learning-rate phase transition at

$\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 7

separates a regime in which the bias remains near zero from one in which GD enters the EoS regime and learns a genuinely negative first-layer bias, yielding a threshold neuron. In that model, large learning rates and unstable convergence are not incidental; they are the mechanism by which threshold-like units emerge (Ahn et al., 2022).

6. Scope, limitations, and adjacent meanings

Several caveats recur across the literature. The basic derivations are typically local and quadratic; stronger sign tests often assume favorable alignment between the gradient and a principal Hessian eigenvector, and in the SAM case the strongest proposition assumes $\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 8. Minibatch results are noisier, and stochastic formulations do not always preserve the clean deterministic picture of sharpness hovering exactly at $\lambda_{\max}(\nabla^2 f(\theta_t)) \approx \frac{2}{\eta},$ 9. These facts limit any claim that the phenomenon is universal in a strict mechanistic sense (Long et al., 2023, Iordan et al., 2023).

A common misconception is that crossing the quadratic threshold must imply immediate divergence. The empirical and theoretical literature instead describes a regime of progress with oscillations: short-horizon monotonicity fails, local quadratic descent guarantees break, and yet optimization can remain effective over long horizons (Cohen et al., 2021, Marion, 16 Jun 2026). Another misconception is that edge of stability is exclusively a Euclidean full-batch GD effect; later work shows optimizer-specific edges for SAM, non-Euclidean methods, SGD, and momentum, with different operative curvature statistics (Long et al., 2023, Islamov et al., 5 Mar 2026, Andreyev et al., 15 Apr 2026).

The term also appears in adjacent but non-identical settings. In Edge of Stability Echo State Networks, the phrase refers to reservoir dynamics organized near an edge-of-chaos regime by construction: the reservoir Jacobian spectrum is confined to an annular neighborhood of a circle of radius $\lambda_{\max}(\nabla^2 L)$ 0, the model has the ESP when $\lambda_{\max}(\nabla^2 L)$ 1, and for small $\lambda_{\max}(\nabla^2 L)$ 2 the maximum local Lyapunov exponent satisfies $\lambda_{\max}(\nabla^2 L)$ 3 (Ceni et al., 2023). In kernel associative memory, the Ridge of Optimization is identified with an information-geometric edge where the Fisher Information Matrix becomes highly concentrated and nearly singular, and natural-gradient geometry produces a self-braking effect along the dominant curvature direction (Tamamori, 28 Nov 2025). This suggests that the label now spans several related boundary phenomena centered on criticality, curvature concentration, and marginal stability.

Across these formulations, the recurring theme is that practical training often does not remain in the classically “safe” small-step regime. Instead, it organizes near a boundary where the update rule, curvature geometry, loss structure, and sometimes data distribution jointly determine whether instability becomes divergence, oscillatory progress, selective learning, or a new implicit bias over representations and solutions.