Saddle-to-Saddle Learning Dynamics
- Saddle-to-saddle learning dynamics are a non-convex phenomenon in deep networks where gradient descent moves through successive saddle points before reaching a plateau or global minimum.
- The mechanism unfolds in distinct phases—plateau, escape, and transition—each marked by characteristic Hessian spectral properties and incremental complexity growth.
- Empirical studies and theoretical analyses reveal stagewise progression, high degeneracy at convergence, and invariant manifold effects across various network architectures.
Saddle-to-saddle learning dynamics describes a class of non-convex optimization behavior—characteristic of deep neural networks and high-dimensional energy landscapes—in which a gradient-based trajectory transitions between a succession of saddle points before reaching a final plateau or global minimum. Unlike optimization schemes in low-dimensional settings, where algorithms primarily aim to escape from isolated local minima, saddle-to-saddle dynamics reflects the incremental, stagewise acquisition of complexity under non-convex, degenerate, and multi-index critical points. This framework unifies observed phenomena such as the prevalence of high-degeneracy saddles at convergence, complexity growth plateaus during training, and implicit low-rank or sparse bias in learned solutions.
1. Critical Points, Degenerate Saddles, and Definitions
A critical point for a loss or energy function satisfies . The Hessian classifies the critical point:
| Criterion | Type | Degeneracy |
|---|---|---|
| Local minimum | Nondegenerate | |
| Local maximum | Nondegenerate | |
| Mixed signs in | Saddle point | Nondegenerate or degenerate |
| Degenerate saddle | Degeneracy |
In high-dimensional deep networks, almost all non-optimal critical points are saddles, and converged solutions are characteristically degenerate—i.e., the Hessian at convergence displays many zero eigenvalues. Good saddles are defined as critical points with small gradient norm, a Lipschitz-continuous Hessian, and a limited number of nonzero eigenvalues—mathematically: a -good saddle satisfies , , and at most nonzero eigenvalues (Sankar et al., 2017).
2. Mechanics of Saddle-to-Saddle Transition
Saddle-to-saddle dynamics consists of a sequence of phases:
- Plateau phase: The trajectory aligns with a manifold (e.g., low-rank, sparse, or low-width configurations) and slows near a saddle (zero or many near-zero Hessian eigenvalues).
- Escape phase: Growth in an unstable direction, often associated with a negative or near-zero Hessian eigenvalue, triggers a transition out of the current saddle's basin.
- Transition and refinement: The trajectory is funneled to the vicinity of a new saddle representing greater model complexity (e.g., higher rank, additional units, less sparsity), and the process repeats.
The escape timescales depend on both geometrical (e.g., spectral gaps, curvature flatness) and initialization effects. In deep ReLU or linear networks with small initialization, empirical and theoretical results show a stagewise process where rank or effective width increases stepwise, each stage associated with a long plateau in training loss (Zhang et al., 23 Dec 2025, Jacot et al., 2021).
3. Rigorous Analyses in Linear and Deep Networks
Saddle-to-saddle learning is rigorously analyzed in several settings:
- Deep Linear Networks: Initialization at small variance places the starting point in the attraction basin of the origin, a rank-0 saddle. Gradient descent escapes this saddle along the top singular direction, moves to a rank-1 saddle, and continues to a sequence of higher-rank saddles, ultimately reaching the global minimum or minimal -norm interpolant. The sequence is well-approximated by concatenations of gradient flows between embedded low-width solutions. Escape times exhibit power laws, with transitions between stages corresponding to rank increments in the end-to-end map (Jacot et al., 2021, Pesme et al., 2023).
- Deep ReLU Networks: The initial trajectory remains near the origin—a degenerate saddle. The first escape proceeds along a direction where weight matrices in deeper layers become increasingly rank-one, evidenced by the scaling , where is the -th singular value. Trainsition to subsequent saddles is conjectured to increase the bottleneck rank one stage at a time, paralleling incremental feature learning (Bantzis et al., 27 May 2025).
- Hidden Hierarchical Functions (Leap Complexity): For two-layer networks learning hierarchical functions, the training sequentially aligns weights to increasingly larger coordinate subsets of the target, incurring plateaus whose durations and frequencies are governed by the leap complexity . The maximal leap encountered controls the training sample complexity as and is reflected in empirical saddle-to-saddle plateaus (Abbe et al., 2023).
4. Universal Simplicity Bias and Invariant Manifold Perspective
A key theoretical development is the universality of the saddle-to-saddle mechanism across architectures—fully-connected, convolutional, and attention-based. Each plateau corresponds to the optimizer's trajectory being nearly trapped on an invariant manifold enforcing a certain minimal effective width, rank, number of kernels, or attention heads. The ODE describing parameter evolution preserves these submanifolds until a single unstable direction enables escape to a "wider" or more expressive region (Zhang et al., 23 Dec 2025).
The stepwise growth in solution complexity across stages can be summarized as follows:
| Model class | Manifold constraint | Plateau corresponds to | Stage transition |
|---|---|---|---|
| Linear nets | Rank | All solutions of rank | |
| ReLU nets | Bottleneck rank | Piecewise-linear maps with kinks | |
| Conv nets | Kernel number | Solutions with kernels | |
| Transformer | Attention heads | Solutions using heads |
The duration of each plateau is determined by data-dependent spectral gaps (linear case), order statistics of initialization (homogeneous/high-degree models), or subspace geometry, yielding power-law and logarithmic scaling regimes for escape time (Zhang et al., 23 Dec 2025).
5. Advanced Algorithmic Realizations and General Solution Landscape Construction
Saddle-to-saddle traversal can be algorithmically exploited in landscape construction:
- High-index Saddle Dynamics (HiSD, iHiSD, NN-HiSD): Dynamic flows that combine gradient descent with reflections along unstable eigendirections enable systematic transitions between index- saddles. Improved methods such as iHiSD allow nonlocal transitions and guarantee that any pair of stationary points can be connected via a sequence of saddle-to-saddle heteroclinic orbits, as formalized by Morse theory (Su et al., 6 Feb 2025, Liu et al., 25 Nov 2024).
- Model-Free and Surrogate-Based Methods: Techniques leveraging Gaussian process surrogates or neural networks can approximate forces or energy landscapes, enabling saddle-to-saddle mapping without explicit knowledge of the underlying function. These approaches yield significant practical gains in sampling efficiency and permit sequential exploration of the global solution landscape (Zhang et al., 2022, Liu et al., 25 Nov 2024).
6. Experimental Signatures and Consequences
Comprehensive empirical studies confirm the theoretical structure of saddle-to-saddle dynamics:
- Hessian spectrum at convergence: Large degeneracy and semicircular eigenvalue density centered at zero, with an increasing fraction of zero eigenvalues as network width or depth increases (Sankar et al., 2017).
- Plateaus in loss: Training curves reveal extended periods of little loss reduction, punctuated by abrupt transitions corresponding to escapes from successive saddles. The number and duration of these plateaus depend on data statistics, initialization scale, and architecture (Jacot et al., 2021, Zhang et al., 23 Dec 2025).
- Incremental complexity growth: Both the rank (in linear nets) and BN-rank (in ReLU nets) grow stepwise, reflecting the incremental addition of singular modes or features as training progresses (Jacot et al., 2021, Bantzis et al., 27 May 2025).
7. Theoretical and Practical Implications
Saddle-to-saddle learning dynamics consolidates several threads in modern non-convex learning theory:
- The convergence of optimization in deep models is typically to high-degeneracy, "good enough" saddles—contrary to classical local minimum paradigms.
- Implicit regularization emerges from the structure of these transitions, as gradient descent incrementally seeks simple (low-rank, sparse, minimal-unit) interpolants before traversing to more complex regions only as necessary for data fitting.
- Universal stagewise complexity growth suggests training can be accelerated or modulated by manipulating initialization, stochasticity, or by perturbing near-saddle epochs.
- Morse theory and reflection-based algorithms provide a rigorous framework for constructing and understanding the topology of entire solution landscapes, beyond local optimization.
The saddle-to-saddle paradigm thus informs both the precise mechanics of training deep networks and the geometric underpinnings of their inductive biases, generalization properties, and optimization efficiency across model classes (Sankar et al., 2017, Su et al., 6 Feb 2025, Zhang et al., 23 Dec 2025, Pesme et al., 2023, Jacot et al., 2021, Abbe et al., 2023, Bantzis et al., 27 May 2025, Zhang et al., 2022, Liu et al., 25 Nov 2024).