Curvature-Informed Step Optimization
- Curvature-informed step is a method that integrates local curvature information (via Hessians, Jacobians, or proxies) to dynamically adapt optimization updates.
- The approach employs preconditioners and adaptive step sizes, yielding faster convergence rates and enhanced robustness to noise and sharp minima.
- Applications span deep learning, manifold optimization, and scientific computing, often outperforming standard gradient descent in efficiency and stability.
A curvature-informed step is an optimization update rule or algorithmic modification in which explicit local or global curvature information—in the form of Hessians, Jacobians, or curvature proxies—directly modulates the update, step-size, or component selection. Curvature-aware steps are designed to exploit geometric properties of the objective landscape, often yielding improved convergence rates, greater training stability, and enhanced robustness to sharp minima, and are increasingly used across deep learning, manifold optimization, model merging, and scientific computing.
1. Mathematical Formulation of Curvature-Informed Steps
Curvature-informed steps generalize standard first-order updates by preconditioning or rescaling the descent direction according to a curvature proxy. The canonical form is
where is the (possibly stochastic) gradient, is a positive-definite preconditioner encoding curvature, and is a learning rate (Pooladzandi et al., 2024). When , this reduces to classical gradient descent; recovers Newton's method.
Construction of may use full Hessian information, low-rank or diagonal approximations, or cheap proxies (e.g., empirical Fisher, secant formulae, or second-moment accumulators). In manifold optimization, and step-size constraints derive from geometric smoothness constants involving sectional curvature bounds (Pareth, 26 Feb 2026).
2. Curvature-Informed Preconditioners and Update Schemes
Curvature-aware strategies span a wide range, often ornamenting first-order methods:
- Lie-group preconditioners: Choose with in a connected Lie subgroup (e.g., diagonal, X-shape, low-rank) (Pooladzandi et al., 2024). Matrix-free or low-rank update rules fit the preconditioner online using Hessian-vector products or finite gradient differences. The update minimizes a convex noise-robust criterion without line search or explicit damping.
- Secant or Rayleigh quotient-based modulation: Compute a curvature indicator as
0
where 1 is the latest step. This serves as a local quadratic model curvature, enabling adaptive gradient "boosting" via 2, where 3 is a curvature-gated gain (An et al., 16 Apr 2026). Similar techniques appear in scale-invariant Monte Carlo with discrete curvature radii (Madhavan et al., 2021), and PINN optimization (Fonseca et al., 2023).
- Kronecker-factored approximations (KFAC): For structured objectives like PINNs, preconditioners are constructed by blockwise Kronecker-product approximation of the Gauss–Newton or natural-gradient metric, incorporating higher derivatives (e.g., Taylor-mode AD for the Laplacian) (Dangel et al., 2024).
- Curvature-aware sparsification/selection: Model merging frameworks reweight or prune parameter vectors using elementwise second-moment statistics as diagonal curvature proxies, e.g., via the saliency score 4, where 5 is the optimizer's second-moment accumulator (Mahdavinia et al., 14 Sep 2025).
- Manifold step-size control: On Riemannian manifolds, curvature is encoded in smoothness constants 6, so explicit upper bounds for (stochastic) gradient and Newton-type steps are
7
with 8 a "geometry package" bounding parallel transport and curvature distortion, and 9 a transported Jacobian spectral bound (Pareth, 26 Feb 2026).
3. Theoretical Guarantees and Convergence Properties
Curvature-informed steps yield improved theoretical guarantees in various regimes:
- Linear or near-quadratic convergence in convex/strongly convex regimes: With suitable spectral bounds on 0, curvature-informed PSD preconditioners 1 produce linear convergence; for 2–3 strong convexity bounds, Newton-like rates are recoverable (Pooladzandi et al., 2024).
- Explicit 4 or geometric rates: Local curvature descent schemes (e.g., LCD1/LCD2) admit explicit convergence rates, replacing global Lipschitz constants 5 with local curvature-derived 6, immediately tightening worst-case rates (Richtárik et al., 2024).
- Noise robustness and step-size normalization: Online preconditioner fitting or step-size modulation naturally damps stochastic noise, removing the need for additional line-search, clipping, or hand-tuned damping (Pooladzandi et al., 2024, Richtárik et al., 2024, Madhavan et al., 2021).
- Curvature-aware Polyak–Łojasiewicz inequalities: In Riemannian settings, explicit curvature-dependent bounds 7 ensure linear convergence provided the manifold geometry is controlled (Pareth, 26 Feb 2026).
4. Algorithmic Instantiations and Pseudocode
Curvature-informed steps are realized by several prominent algorithms:
| Methodology | Preconditioner / Modulation | Key Ingredients |
|---|---|---|
| PSGD (Pooladzandi et al., 2024) | 8, 9 in Lie group | Curvature via Hessian-vector/finite diff |
| KFAC for PINNs (Dangel et al., 2024) | 0 | Taylor-mode AD on network for PDE loss |
| OTA+FFG (Mahdavinia et al., 14 Sep 2025) | Diagonal via 1 | Adam 2nd moment, Fisher/Hessian proxy |
| CA-AdamW (An et al., 16 Apr 2026) | Rayleigh-quotient gain on secant correction | Secant-based adaptive boost |
| LCD2 (Richtárik et al., 2024) | Step-size 2 | Local curvature mapping 3 |
All algorithms use cheap curvature proxies (directional derivatives, accumulated second moments, local models) or structured approximations (diagonal, Kronecker, Lie subgroups) to keep overhead manageable.
5. Practical Considerations and Empirical Findings
Several empirical conclusions are common across curvature-informed methods:
- Robust speedup and stability: PSGD, KFAC, and CA-AdamW all demonstrate faster convergence and better minima in high-dimensional deep learning and PDE benchmarks, with small computational penalty (e.g., PSGD achieves ≈1.2× SGD's per-iteration cost) (Pooladzandi et al., 2024, Dangel et al., 2024, An et al., 16 Apr 2026).
- Noise and hyperparameter resilience: Preconditioner-fitting objectives or curvature-based step-size modulations are robust to stochasticity, requiring less tuning of learning rates or damping terms (Pooladzandi et al., 2024, Mahdavinia et al., 14 Sep 2025, Madhavan et al., 2021).
- Structure-aware parameter selection: Model merging via OTA+FFG leverages shared curvature geometry; pruned experts achieve equivalent or better task performance at densities as low as 1–10% (Mahdavinia et al., 14 Sep 2025). In point cloud downsizing, curvature-informed sampling better preserves sharp features (Bhardwaj et al., 2024).
- Improved generalization and flatness: Curvature-informed optimization tends to find flatter minima, empirically yielding improved generalization across tasks (e.g., vision, NLP, RL) (Pooladzandi et al., 2024, Mahdavinia et al., 14 Sep 2025).
6. Extensions: Manifolds, Geometry, and Curvature-Regulated Dynamics
Curvature-informed steps generalize beyond flat parameter spaces:
- Riemannian manifolds: Optimization on spaces with nontrivial geometric structure (e.g., SO(3), SE(3)) requires all step-size bounds and convergence analyses to explicitly account for sectional curvature, injectivity radius, and parallel transport distortion via a geometry package constant 4. The resulting curvature-aware Sobolev constants define descent lemmas, step bounds, and local quadratic contraction for Newton-type methods (Pareth, 26 Feb 2026).
- Graph and temporal diffusion: In dynamic network models, curvature (e.g., Ollivier-Ricci on graphs) guides information flow. Infection time prediction (R-ODE) selects the next informed node by maximal Ricci curvature, capturing the minimum "transportation effort" in learned embeddings (Sun et al., 2024).
7. Outlook and Significance
The curvature-informed step has emerged as a unifying paradigm bridging optimization theory, machine learning, manifold geometry, and large-scale model maintenance. By systematically incorporating second-order local information—either exactly, in approximated form, or via efficient surrogates—these procedures navigate complex loss surfaces with improved efficiency and robustness.
Notable patterns include the convergence of ideas from disparate communities: numerical optimization, geometric learning, post-hoc model merging, and scientific computing. A plausible implication is that future large-scale and scientific ML systems will increasingly rely on lightweight, curvature-aware primitives for both computational tractability and reliability in challenging, high-dimensional, and geometrically structured settings.
Relevant references include (Pooladzandi et al., 2024, Mahdavinia et al., 14 Sep 2025, Richtárik et al., 2024, An et al., 16 Apr 2026, Madhavan et al., 2021, Pareth, 26 Feb 2026, Bhardwaj et al., 2024, Dangel et al., 2024, Fonseca et al., 2023, Sun et al., 2024).