The Origin of Edge of Stability

Published 22 Apr 2026 in cs.LG and stat.ML | (2604.20446v1)

Abstract: Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/η$, where $η$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/η$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/η$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/η$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.

Abstract PDF Upgrade to Chat

Authors (1)

Elon Litman

Summary

The paper introduces a variational edge coupling functional that explains how full-batch gradient descent trajectories are forced to maintain an effective curvature of 2/η.
It derives precise step recurrences and conservation laws that account for the observed oscillatory behavior and period-doubling bifurcations at critical learning rates.
The analysis extends to mini-batch SGD, offering theoretical insights into stability, curvature forcing, and the universal dynamics underlying neural network training.

Unified Variational Foundation for the Edge of Stability in Full-Batch Gradient Descent

Edge of Stability: Empirical Phenomena and Prior Theory

The Edge of Stability (EoS) characterizes the long-term dynamics of full-batch gradient descent (GD) on neural networks, whereby the spectral norm (sharpness) of the Hessian escalates during training and eventually saturates at the threshold $2/\eta$ , with $\eta$ the learning rate. Beyond this threshold, loss generally oscillates locally while continuing to descend over longer timescales. This phenomenon is robust across a wide range of architectures, objectives, and datasets, as illustrated empirically (Figure 1).

Figure 1: Edge of Stability observed in a 3-layer MLP trained on CIFAR-10; both effective curvature and sharpness saturate at $2/\eta$ , while training loss exhibits local oscillations.

Classical optimization theory dictates monotonic loss decrease when $\eta < 2/\lambda_{\max}$ for the largest Hessian eigenvalue. However, empirical observations show GD trajectories are "forced" to the $2/\eta$ boundary, contradicting the conventional wisdom that dynamical self-regulation alone maintains stability at the edge. Prior analyses established local stabilizing feedback mechanisms but failed to explain global attraction toward the edge from arbitrary initialization.

Edge Coupling Functional: Variational Characterization and Recurrence

This paper introduces the edge coupling functional: $\mathcal{A}_\eta(x, y) = L(x) + L(y) - \frac{1}{2\eta}\|x - y\|^2$ whose critical points encode GD as a boundary-value problem. Setting the $x$ -gradient to zero yields the exact GD update, and setting both partial gradients to zero classifies fixed points and period-two orbits. The edge coupling extends classical mechanics variational principles to discrete-time optimization, establishing a rigorous foundation for oscillation phenomena at the EoS.

Expanding $\mathcal{A}_\eta$ and differencing the criticality conditions across steps leads to the step recurrence: $d_{k+1} = (I - \eta \bar{H}_k) d_k$ where $d_k = w_{k+1} - w_k$ and $\eta$ 0 is the step-averaged Hessian. The loss-change per step is: $\eta$ 1 with $\eta$ 2 the triangular average along the step. Summing these loss-changes telescopes to a global conservation law, showing the trajectory is compelled to visit steps with effective curvature arbitrarily close to $\eta$ 3.

Period-Doubling Bifurcation and Center Reduction

The structure of period-two orbits and the onset of oscillatory instability are analyzed through the bifurcation theory of $\eta$ 4. The center reduction collapses the bifurcation problem into a nonlinear eigenvalue equation parameterized by the half-amplitude $\eta$ 5: $\eta$ 6 where $\eta$ 7 captures the symmetrized loss around the oscillation center. Near the spectral threshold, the quartic expansion of $\eta$ 8 determines whether bifurcating branches appear and on which side of the critical learning rate $\eta$ 9 they emerge.

For two-layer linear networks, this analysis is width-invariant—the period-doubling branch appears continuously for $2/\eta$ 0 irrespective of the hidden dimension (Figure 2), substantiating the theory's universality across parameterizations.

Figure 2: Continuous onset of period-doubling in a two-layer linear network, with amplitude tracking predicted $2/\eta$ 1 scaling.

Curvature Forcing and Sharpness Concentration

The global conservation law imposes stepwise sharpness concentration. For GD with step size $2/\eta$ 2 on a $2/\eta$ 3 bounded below loss, the quantity

$2/\eta$ 4

ensures the trajectory's weighted curvature approaches $2/\eta$ 5 as the cumulative step norm grows. This result does not rely on monotonic loss descent and only requires bounded total loss drop. Step-level concentration bounds are also established, showing that excursions away from $2/\eta$ 6 are of finite measure for any fixed window.

Sharpness forcing is exact: via the mean value theorem, stepwise curvature averages are realized at specific interior points along each GD edge, so the forcing transfers to the true Hessian eigenvalue with no gap (Figure 3).

Figure 3: Validation of sharpness concentration and loss-change formula; weighted average curvature tightly converges to $2/\eta$ 7 across learning rates.

Oscillatory Stability Mechanisms and Near-Periodicity

Stability at the edge is maintained through two mechanisms:

Growth above the threshold: When curvature exceeds $2/\eta$ 8, step norm grows geometrically, triggering oscillation.
Oscillatory cancellation: Inside the stability window, the near-critical multiplier $2/\eta$ 9 causes alternate reversals, with cumulative displacement bounded by variations in step direction.

In practice, GD exhibits near-periodicity, with two-step returns closely approximating period-two orbits, ensuring directional curvature remains tightly bound to $\eta < 2/\lambda_{\max}$ 0 (Figure 4).

Figure 4: Two-step return ratio throughout training, confirming approximate step reversal and near-periodicity at the EoS.

Extensions: Mini-Batch SGD and Pairwise Stability

The forcing and concentration theorems extend to mini-batch SGD, where the conservation law remains intact modulo the variance of stochastic gradient noise. The edge coupling also naturally generalizes to algorithmic stability analysis between pairs of GD trajectories, casting stability as a discrete or continuous Kelvin–Voigt equation.

Practical and Theoretical Implications

The variational derivation unifies disparate local and global threads in the analysis of EoS. The results provide sharp theoretical constraints on the spectral dynamics of full-batch GD, offering insight into the universal behaviors underlying training instability and the implicit bias toward flat minima at large learning rates. The reduction to nonlinear eigenproblems and bifurcation theory lays the foundation for compositional analysis across architectures and parameter regimes, enabling exact prediction of phase boundaries in neural dynamics.

Limitations include the requirement for $\eta < 2/\lambda_{\max}$ 1 and $\eta < 2/\lambda_{\max}$ 2 smoothness, which excludes losses with unregularized ReLU activations. The impact on practical loss descent during EoS remains to be characterized. Future directions involve connecting the concentration laws to generalization bounds and extending the framework to analyze algorithmic stability and curvature-aware optimization protocols.

Conclusion

The edge coupling functional $\eta < 2/\lambda_{\max}$ 3 resolves the origin of the Edge of Stability in gradient descent. Its criticality conditions yield exact recurrence and conservation laws, forcing the effective stepwise curvature to $\eta < 2/\lambda_{\max}$ 4 globally, substantiated by mean value localization to true Hessian eigenvalues. The analysis of period-two bifurcations, width invariance, and stability mechanisms establishes a comprehensive theoretical framework for the dynamical and spectral phenomena observed at the EoS. This foundation points to new vistas for algorithmic stability, generalization, and phase analysis in learning dynamics.