Edge-of-Stability Regime in Neural Networks

Updated 27 April 2026

Edge-of-Stability is a regime in overparameterized neural networks where training oscillates near a sharpness threshold, defined by a critical relation between step size and the largest Hessian eigenvalue.
The regime unfolds in phases—from progressive sharpening towards a stability limit, to transient overshoots, and finally to self-stabilizing oscillations that enable efficient descent despite local instability.
Its dynamics critically influence learning rate tuning, batch size, momentum selection, and implicit regularization, thereby impacting model generalization and optimization effectiveness.

The Edge-of-Stability (EoS) regime refers to a ubiquitous and distinctive phase in the training of overparameterized neural networks and related models, characterized by systematic oscillatory behavior of sharpness (typically quantified as the largest Hessian eigenvalue or related curvature proxy) near a precise instability threshold dictated by the optimization step size. Contrary to classical smooth optimization theory, which mandates that gradient descent step sizes remain below a stability cutoff, empirical and theoretical analysis reveal that optimization frequently transits and then self-regulates precisely at—or in specific cases, slightly below—this boundary, enabling efficient descent despite the apparent local instability.

1. Definition and Unifying Theoretical Principles

The EoS regime is precisely delineated by the coupling between the optimizer step size $\eta$ and curvature/“sharpness” metrics:

Sharpness: Typically $\lambda_{\max}(\nabla^2L(w))$ , the largest eigenvalue of the Hessian of the loss $L$ .
Stability threshold: For classical gradient descent (GD) on a quadratic model, monotonic convergence requires $\eta \cdot \lambda_{\max} < 2$ . Violation of this induces oscillatory, but not divergent, descent (Arora et al., 2022, Zhu et al., 2022).
Edge of Stability: The regime where $\lambda_{\max} \gtrsim 2/\eta$ , yet training loss continues to decrease on average, interleaved with non-monotonic local oscillations (Litman, 22 Apr 2026).

The emergence of this phenomenon is universal: full-batch GD, stochastic optimization, preconditioned/adaptive methods, and even zeroth-order algorithms all exhibit sharpness self-regulation at boundaries dictated by step size, optimizer geometry, and problem structure (Islamov et al., 5 Mar 2026, 2207.14484, Song et al., 16 Apr 2026).

2. Phases and Dynamics of Edge-of-Stability

A canonical EoS trajectory unfolds in discernible phases (not all present in every model):

Phase I: Progressive Sharpening — Early iterations see sharpness increase monotonically toward the stability threshold, driven by gradient flow-like dynamics (Li et al., 2022, Liu et al., 4 Mar 2025). Output layer scaling and data-covariance properties tightly predict this upward movement.
Phase II: Instability/Transition — On reaching and marginally exceeding $2/\eta$ , the trajectory exhibits transient overshoots and reversal events in the leading curvature direction, causing small but distinct increases in loss and norm adjustments in network weights (Zhu et al., 2022, Li et al., 2022).
Phase III: Self-Stabilization — Oscillatory interactions between top modes, output scaling, and geometrical alignment force sharpness to hover near the threshold, with the trajectory prevented from persistent instability via dynamic readjustments (Litman, 22 Apr 2026, Liu et al., 4 Mar 2025).
Cycle Repeats — Over training, these oscillatory cycles repeat, underpinning slow but steady optimization progress despite formal local instability.

The EoS is robust to the presence of nonconvexity, high depth, overparameterization, and intricate neural architectures (Ghosh et al., 27 Feb 2025, Gan, 3 Apr 2026, Zhang et al., 2024).

3. Model-Specific Instantiations and Extensions

EoS appears across a broad spectrum of optimizers and architectures, each with unique stability signatures:

Setting	Sharpness Constraint	Regularized Quantity	Reference
Full-batch GD	$\lambda_{\max}(H) \rightarrow 2/\eta$	Top Hessian eigenvalue	(Arora et al., 2022, Litman, 22 Apr 2026)
SGD (mini-batch)	$\text{Batch Sharpness} \rightarrow 2/\eta$	Expected mini-batch directional	(Andreyev et al., 2024, Andreyev et al., 15 Apr 2026)
Adam, Adagrad	$\lambda_{\max}(P^{-1} H)\rightarrow{\rm thresh.}$	Preconditioned Hessian eigenvalue	(2207.14484)
Zeroth-Order (ZO)	$\operatorname{Tr}(H)\rightarrow 2/\eta$	Hessian trace	(Song et al., 16 Apr 2026)

For momentum and stochastic cases, sharpness plateaus interpolate between lower ( $\lambda_{\max}(\nabla^2L(w))$ 0) and higher ( $\lambda_{\max}(\nabla^2L(w))$ 1) thresholds depending on batch size and noise, delineating deterministic and noise-dominated EoS (Andreyev et al., 15 Apr 2026).

In kernel (NTK) or deep linear matrix factorization models, EoS governs both eigenvalue spectra and eigenvector evolution, with feature learning and target alignment dynamics tightly controlled by sharpness cycles (Jiang et al., 17 Jul 2025, Ghosh et al., 27 Feb 2025).

4. Analytical Mechanisms and Bifurcations

EoS is fundamentally linked to the spectral and geometric structure of the loss landscape and the discrete-time update map:

Edge coupling functional: A symmetric “action” on consecutive iterate pairs, with criticality fixed by $\lambda_{\max}(\nabla^2L(w))$ 2, organizes all dynamics. Differencing its optimality condition yields a recurrence whose linear stability boundary is $\lambda_{\max}(\nabla^2L(w))$ 3 (Litman, 22 Apr 2026).
Bifurcation theory: Nonlinear period-doubling and center manifold expansions reveal that, above threshold, training settles into one- or multi-dimensional periodic orbits (period-2 cycles) pinning sharpness at or just below $\lambda_{\max}(\nabla^2L(w))$ 4. The side and nature of bifurcations depend on higher-order derivatives and product-stability conditions of the loss (Chen et al., 2022, Zhu et al., 2022, Gan, 3 Apr 2026).
Mean-value localization: Taylor and telescoping arguments localize observed sharpness oscillations to actual Hessian values at interior points along the GD step, yielding exact edge-forcing of the true spectrum (Litman, 22 Apr 2026).
Non-Euclidean and adaptive geometries: EoS generalizes to arbitrary norm geometries and adaptive algorithms via a sharpness quantity $\lambda_{\max}(\nabla^2L(w))$ 5, with the GD stability edge universally at $\lambda_{\max}(\nabla^2L(w))$ 6 (Islamov et al., 5 Mar 2026, 2207.14484).

5. Implicit Regularization and Generalization

Operating at the EoS induces strong implicit biases in model selection:

Classical GD: Large $\lambda_{\max}(\nabla^2L(w))$ 7 drives the model into regions of low top Hessian eigenvalue, biasing toward flatter minima in the direction of sharpest curvature (Arora et al., 2022, Liu et al., 4 Mar 2025).
ZO methods: The mean-square EoS places the constraint on trace( $\lambda_{\max}(\nabla^2L(w))$ 8), hence regularization acts on the “bulk” of the spectrum and not merely its edge (Song et al., 16 Apr 2026).
Variational/injective frameworks: By dynamically tuning the effective EoS threshold, methods like variational learning force diffusion toward lower-sharpness minima and enable controlled generalization improvements (Ghosh et al., 15 Jun 2025).
SGD and EoSS: Small batch size exaggerates mini-batch sharpness and, due to convexity arguments (Jensen's gap), self-regularizes the full-batch sharpness to even flatter regions than full-batch GD would; empirically this is tightly coupled to improved generalization (Andreyev et al., 2024, Tuci et al., 21 Apr 2026).
Sharpness Dimension: Generalization in the EoS regime correlates with the fractal dimension of the long-run attractor set generated by the optimizer dynamics, which, under EoS, is often substantially lower than the ambient parameter dimension (Tuci et al., 21 Apr 2026).

6. Extensions, Singular Cases, and Limitations

Several advanced and singular regimes have been mathematically characterized:

Overparameterized quadratics: Even with quadratic loss, if parameterization is nonlinear (e.g., depth-2 diagonal nets), EoS arises and leads to structured oscillatory convergence and bias away from minimum-norm interpolants (Zhang et al., 2024).
Loss-specific phenomena: Logistic regression, but not exponential loss, possesses global convergence and classical max-margin bias at EoS for arbitrary stepsizes—even beyond stability—highlighting loss-dependent mechanisms (Wu et al., 2023).
PDE and numerical viewpoints: EoS coincides with bounded, restrained numerical instabilities in gradient-flow PDEs for deep nets, where nonlinearity dynamically suppresses instability growth, explaining why divergence does not occur even when classical conditions are violated (Sun et al., 2022).
Differential Privacy: DP noise and clipping slow or inhibit the attainment of EoS, lowering sharpness plateaus and biasing toward even flatter minima; practical schedules must account for this in tuning learning rate and privacy parameters (Hussain et al., 22 Dec 2025).
Limitations: EoS theory for multidimensional, non-factorized, highly nonlinear, or dynamic-learning-rate settings is still evolving. Local analysis often does not fully explain the global approach to EoS from random initialization, and convergence guarantees may be limited to neighborhoods of stable fixed points or cycles (Gan, 3 Apr 2026, Zhu et al., 2022).

7. Practical Considerations and Applications

EoS governs both optimization efficiency and generalization trade-offs, with significant repercussions for practical deep learning:

Step size tuning: Efficient convergence often requires pushing $\lambda_{\max}(\nabla^2L(w))$ 9 up to—but not far beyond—the EoS threshold; excessive overshoot can induce prolonged instability or slow convergence along flat directions (Li et al., 2022, Liu et al., 4 Mar 2025).
Batch size and momentum selection: Batch sharpness monitoring and adaptive schedule interventions can safely maintain the system at the EoSS or EoS plateau, maximizing speed while avoiding divergence (Andreyev et al., 15 Apr 2026, Andreyev et al., 2024).
Regularization and model robustness: Mechanisms such as weight decay, architectural choices, variational “temperature”, noise, and gradient clipping modify the effective EoS threshold and thereby allow explicit control of the trajectory’s implicit bias and flatter-regions exploration (Ghosh et al., 15 Jun 2025, Hussain et al., 22 Dec 2025).
Feature learning and NTK adaptation: Periodic sharpness cycles induce rotation and realignment of leading kernel eigenvectors with targets, undergirding feature learning properties beyond the lazy NTK regime (Jiang et al., 17 Jul 2025).

In totality, the Edge-of-Stability regime constitutes a unifying principle in modern deep learning optimization, explaining why and how large learning rates yield both efficient convergence and robust generalization: the optimizer, network, and data geometry together induce self-regulation at sharpness thresholds dictated by the training protocol, dynamically balancing local instabilities and global descent (Litman, 22 Apr 2026, Arora et al., 2022, Tuci et al., 21 Apr 2026, Islamov et al., 5 Mar 2026, Andreyev et al., 2024, 2207.14484, Song et al., 16 Apr 2026).