Edge of Stability (EoS) Regime

Updated 22 April 2026

Edge of Stability (EoS) is a regime in deep learning optimization where gradient descent operates beyond classical stability thresholds, exhibiting non-monotonic and oscillatory loss dynamics.
The theoretical framework reveals that product-stability conditions and period-2 bifurcations enable convergence even when maximal Hessian eigenvalues exceed traditional bounds.
Empirical studies across various architectures and optimizers confirm that EoS dynamics facilitate aggressive training schedules and implicit regularization.

The Edge of Stability (EoS) regime describes an empirically robust and theoretically non-classical phase of gradient descent-based optimization in modern deep learning, characterized by the persistent violation of the classical local stability threshold for convergence. In this regime, the “sharpness” (i.e., the maximal Hessian eigenvalue of the training loss) exceeds the classical bound predicted under quadratic approximation, yet the loss continues to decrease over long time scales via non-monotonic and oscillatory dynamics. EoS has been observed universally across architectures and optimization setups, and recent works provide a quantitative theory for conditions under which EoS-based convergence is guaranteed, unifying a growing literature on this phenomenon (Gan, 3 Apr 2026).

1. Classical Stability, Definition, and Empirical Manifestation

The classical stability condition for gradient descent (GD) states that, for a smooth loss $L(\theta)$ , descent is guaranteed provided: $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ where $\eta$ is the step size and $\lambda_{\max}$ denotes the largest eigenvalue of the Hessian at $\theta$ . When the product exceeds 2, the local quadratic approximation predicts divergence.

However, in deep learning practice, sharpness not only grows to approach this threshold but typically overshoots and oscillates around: $\lambda_{\max} \approx \frac{2}{\eta}.$ Despite periods where $\eta \, \lambda_{\max} > 2$ , long-term loss continues to descend, albeit non-monotonically (Cohen et al., 2021). This phenomenon, termed the Edge of Stability (EoS), is widely observed in full-batch GD, stochastic variants, and even adaptive optimization methods, with architecturally and data-agnostic robustness (Li et al., 2022, 2207.14484, Andreyev et al., 15 Apr 2026).

Definitionally, EoS is the regime in which the optimization trajectory persists despite repeated violations of the classical quadratic stability bound, with oscillatory loss and sharpness dynamics (Gan, 3 Apr 2026).

2. Theoretical Mechanisms: Product-Stability and Convergence Conditions

Recent advancements extend the theoretical foundation of EoS beyond ad-hoc or restricted cases to a broader class of losses and model parameterizations (Gan, 3 Apr 2026). The central result is that, for factorizable objectives of the form $L(x, y) = l(xy)$ with $l \in C^5$ , convergence in EoS is restored when the loss is “product-stable”—a structural property defined by: $\alpha_l(z_*) = 3[l^{(3)}(z_*)]^2 - l^{(4)}(z_*)l''(z_*) > 0,$ where $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 0 is a local minimum. This product-stability criterion is generic; it encompasses logistic and binary cross-entropy loss, multilayer squared loss, and their generalizations.

In such cases, even when $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 1 (hyper-sharpness), GD exhibits two-step dynamics with period-2 oscillatory orbits (two-step fixed points) that are locally attracting. Convergence is then guaranteed for initializations near $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 2, provided: $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 3 for some $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 4. Sharpness at convergence is given by: $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 5 demonstrating precise quantification of the gap below the threshold (Gan, 3 Apr 2026).

The three-phase convergence dynamics comprise (I) rapid approach to the period-2 manifold, (II) slow drift as oscillation amplitude shrinks, and (III) eventual transition to the classical stability regime for linear convergence.

3. Dynamics, Bifurcations, and Periodic Orbits in EoS

Beyond scalar factorized settings, EoS dynamics have been analyzed through explicit construction and analysis of period-2 orbits in higher dimensions and over-parameterized models (Chen et al., 2022, Ghosh et al., 27 Feb 2025, Zhu et al., 2022). In deep linear networks (DLN), the period-doubling bifurcation structure is analytically characterized: only the top singular-modes whose curvature exceeds the stability threshold participate in oscillatory cycles, and the “oscillation rank” of the dynamics is exactly determined by $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 6 (Ghosh et al., 27 Feb 2025). This restricts loss oscillations within a subspace aligned with data-induced features.

Minimalist nonconvex toy models and deep networks both reveal global bifurcations: period-doubling transitions, chaos, and even fractal basin boundaries for initializations, confirming the genericity and complexity of EoS attractors (Zhu et al., 2022, Liu et al., 4 Mar 2025). In these systems, EoS is marked by sharpness hovering just below $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 7 and non-monotonic yet convergent loss.

4. Generalization, Attractor Dimension, and Implicit Regularization

Stochastic (mini-batch) variants and momentum-based optimizers (e.g., SGD with momentum, Adam) exhibit analogous phenomena: batch sharpness, rather than full-batch Hessian sharpness, hovers at optimizer-specific thresholds such as $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 8 for small batches and $\eta \cdot \lambda_{\max}(\nabla^2 L(\theta)) < 2,$ 9 for large batches (where $\eta$ 0 is the momentum parameter) (Andreyev et al., 15 Apr 2026, Andreyev et al., 2024). Adaptive methods adapt the threshold further (e.g., $\eta$ 1 for Adam with $\eta$ 2 (2207.14484)).

The EoS regime induces dynamics on a low-dimensional, possibly fractal pullback attractor, rather than classical pointwise convergence. This attractor's dimension—the sharpness dimension, derived from Lyapunov exponents of the optimizer’s random dynamical system—upper-bounds the generalization gap, outperforming conventional flatness or norm-based complexity measures in the chaotic EoS regime (Tuci et al., 21 Apr 2026).

Oscillatory EoS dynamics induce an implicit regularizer on the solution: deterministic GD with large step size drifts along the loss manifold toward regions of lower sharpness, dynamically biasing toward flatter minima (Arora et al., 2022, Jiang et al., 17 Jul 2025). In variational learning, EoS analysis shows even sharper thresholds for flatness can be achieved, with plateau values tuning via posterior covariance (Ghosh et al., 15 Jun 2025).

5. Examples Across Architectures, Losses, and Optimization Variants

Table: Product-Stable Losses (excerpt from (Gan, 3 Apr 2026))

Loss Function	Example form	Product-Stability ( $\eta$ 3)
Binary Cross-Entropy (BCE)	$\eta$ 4	Yes, $\eta$ 5
Multilayer Squared Loss	$\eta$ 6	Yes, for $\eta$ 7
Subquadratic (“good regularity”)	$\eta$ 8	Yes, at $\eta$ 9

Empirical studies confirm EoS in feedforward networks, CNNs, Vision Transformers, and even recurrence-based models (e.g., Edge-of-Stability Echo State Networks (Ceni et al., 2023)). All exhibit either persistent or eventual sharpness plateauing at the regime-specific stability boundary.

Optimization with differential privacy modulates and often dampens EoS oscillations: gradient clipping and noise shift the effective plateau downward, especially under strong privacy, yet EoS-like dynamics and boundary behaviors remain (Hussain et al., 22 Dec 2025).

6. Practical Implications and Broader Significance

EoS provides a principled explanation for large learning-rate training, sharpness-aware regularization, and informs the construction of step-size and batch-size schedules to maximize optimization speed without sacrificing stability (Gan, 3 Apr 2026, Andreyev et al., 15 Apr 2026). Product-stability unifies prior isolated sufficient conditions for EoS convergence (e.g., subquadratic, degree of regularity, specific minimum structure) under a checkable criterion based on local higher-order derivatives.

The EoS framework quantitatively predicts when aggressive training schedules will succeed, and how meta-parameters (momentum, batch size, preconditioning, posterior covariance) shift the stability boundary in practical optimization algorithms.

7. Limitations, Open Questions, and Current Frontiers

Current EoS theory addresses deterministic and mini-batch stochastic optimization; fully extending to high-noise stochasticity, highly non-smooth or non-standard architectures, and other adaptive or blockwise optimizers remains an open question. In deep nonlinear and over-parametrized regimes, generic conditions for global stability, the scaling of oscillatory subspaces, and the emergence of chaos are under active investigation (Ghosh et al., 27 Feb 2025, Tuci et al., 21 Apr 2026).

Extensions to non-Euclidean metrics show EoS persists with a generalized sharpness measure dictated by the optimizer's geometry, encompassing a broad portfolio of preconditioned, block-wise, and norm-adaptive methods (Islamov et al., 5 Mar 2026). Quantitative diagnostics (e.g., $\lambda_{\max}$ 0) now permit empirical prediction and control of EoS behavior in diverse architectures.

EoS is thus established as a generic, theoretically principled regime of modern deep learning optimization, unifying disparate observations, furnishing design principles for new optimizers, and reshaping understanding of stability, convergence, and generalization throughout the field (Gan, 3 Apr 2026, Tuci et al., 21 Apr 2026).