- The paper introduces a variational edge coupling functional that explains how full-batch gradient descent trajectories are forced to maintain an effective curvature of 2/η.
- It derives precise step recurrences and conservation laws that account for the observed oscillatory behavior and period-doubling bifurcations at critical learning rates.
- The analysis extends to mini-batch SGD, offering theoretical insights into stability, curvature forcing, and the universal dynamics underlying neural network training.
Unified Variational Foundation for the Edge of Stability in Full-Batch Gradient Descent
Edge of Stability: Empirical Phenomena and Prior Theory
The Edge of Stability (EoS) characterizes the long-term dynamics of full-batch gradient descent (GD) on neural networks, whereby the spectral norm (sharpness) of the Hessian escalates during training and eventually saturates at the threshold 2/η, with η the learning rate. Beyond this threshold, loss generally oscillates locally while continuing to descend over longer timescales. This phenomenon is robust across a wide range of architectures, objectives, and datasets, as illustrated empirically (Figure 1).

Figure 1: Edge of Stability observed in a 3-layer MLP trained on CIFAR-10; both effective curvature and sharpness saturate at 2/η, while training loss exhibits local oscillations.
Classical optimization theory dictates monotonic loss decrease when η<2/λmax for the largest Hessian eigenvalue. However, empirical observations show GD trajectories are "forced" to the 2/η boundary, contradicting the conventional wisdom that dynamical self-regulation alone maintains stability at the edge. Prior analyses established local stabilizing feedback mechanisms but failed to explain global attraction toward the edge from arbitrary initialization.
Edge Coupling Functional: Variational Characterization and Recurrence
This paper introduces the edge coupling functional: Aη(x,y)=L(x)+L(y)−2η1∥x−y∥2
whose critical points encode GD as a boundary-value problem. Setting the x-gradient to zero yields the exact GD update, and setting both partial gradients to zero classifies fixed points and period-two orbits. The edge coupling extends classical mechanics variational principles to discrete-time optimization, establishing a rigorous foundation for oscillation phenomena at the EoS.
Expanding Aη and differencing the criticality conditions across steps leads to the step recurrence: dk+1=(I−ηHˉk)dk
where dk=wk+1−wk and η0 is the step-averaged Hessian. The loss-change per step is: η1
with η2 the triangular average along the step. Summing these loss-changes telescopes to a global conservation law, showing the trajectory is compelled to visit steps with effective curvature arbitrarily close to η3.
Period-Doubling Bifurcation and Center Reduction
The structure of period-two orbits and the onset of oscillatory instability are analyzed through the bifurcation theory of η4. The center reduction collapses the bifurcation problem into a nonlinear eigenvalue equation parameterized by the half-amplitude η5: η6
where η7 captures the symmetrized loss around the oscillation center. Near the spectral threshold, the quartic expansion of η8 determines whether bifurcating branches appear and on which side of the critical learning rate η9 they emerge.
For two-layer linear networks, this analysis is width-invariant—the period-doubling branch appears continuously for 2/η0 irrespective of the hidden dimension (Figure 2), substantiating the theory's universality across parameterizations.

Figure 2: Continuous onset of period-doubling in a two-layer linear network, with amplitude tracking predicted 2/η1 scaling.
Curvature Forcing and Sharpness Concentration
The global conservation law imposes stepwise sharpness concentration. For GD with step size 2/η2 on a 2/η3 bounded below loss, the quantity
2/η4
ensures the trajectory's weighted curvature approaches 2/η5 as the cumulative step norm grows. This result does not rely on monotonic loss descent and only requires bounded total loss drop. Step-level concentration bounds are also established, showing that excursions away from 2/η6 are of finite measure for any fixed window.
Sharpness forcing is exact: via the mean value theorem, stepwise curvature averages are realized at specific interior points along each GD edge, so the forcing transfers to the true Hessian eigenvalue with no gap (Figure 3).

Figure 3: Validation of sharpness concentration and loss-change formula; weighted average curvature tightly converges to 2/η7 across learning rates.
Oscillatory Stability Mechanisms and Near-Periodicity
Stability at the edge is maintained through two mechanisms:
- Growth above the threshold: When curvature exceeds 2/η8, step norm grows geometrically, triggering oscillation.
- Oscillatory cancellation: Inside the stability window, the near-critical multiplier 2/η9 causes alternate reversals, with cumulative displacement bounded by variations in step direction.
In practice, GD exhibits near-periodicity, with two-step returns closely approximating period-two orbits, ensuring directional curvature remains tightly bound to η<2/λmax0 (Figure 4).
Figure 4: Two-step return ratio throughout training, confirming approximate step reversal and near-periodicity at the EoS.
Extensions: Mini-Batch SGD and Pairwise Stability
The forcing and concentration theorems extend to mini-batch SGD, where the conservation law remains intact modulo the variance of stochastic gradient noise. The edge coupling also naturally generalizes to algorithmic stability analysis between pairs of GD trajectories, casting stability as a discrete or continuous Kelvin–Voigt equation.
Practical and Theoretical Implications
The variational derivation unifies disparate local and global threads in the analysis of EoS. The results provide sharp theoretical constraints on the spectral dynamics of full-batch GD, offering insight into the universal behaviors underlying training instability and the implicit bias toward flat minima at large learning rates. The reduction to nonlinear eigenproblems and bifurcation theory lays the foundation for compositional analysis across architectures and parameter regimes, enabling exact prediction of phase boundaries in neural dynamics.
Limitations include the requirement for η<2/λmax1 and η<2/λmax2 smoothness, which excludes losses with unregularized ReLU activations. The impact on practical loss descent during EoS remains to be characterized. Future directions involve connecting the concentration laws to generalization bounds and extending the framework to analyze algorithmic stability and curvature-aware optimization protocols.
Conclusion
The edge coupling functional η<2/λmax3 resolves the origin of the Edge of Stability in gradient descent. Its criticality conditions yield exact recurrence and conservation laws, forcing the effective stepwise curvature to η<2/λmax4 globally, substantiated by mean value localization to true Hessian eigenvalues. The analysis of period-two bifurcations, width invariance, and stability mechanisms establishes a comprehensive theoretical framework for the dynamical and spectral phenomena observed at the EoS. This foundation points to new vistas for algorithmic stability, generalization, and phase analysis in learning dynamics.