Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Fast Convergence Rates for Subsampled Natural Gradient Algorithms on Quadratic Model Problems (2508.21022v1)

Published 28 Aug 2025 in cs.LG, math.OC, and stat.ML

Abstract: Subsampled natural gradient descent (SNGD) has shown impressive results for parametric optimization tasks in scientific machine learning, such as neural network wavefunctions and physics-informed neural networks, but it has lacked a theoretical explanation. We address this gap by analyzing the convergence of SNGD and its accelerated variant, SPRING, for idealized parametric optimization problems where the model is linear and the loss function is strongly convex and quadratic. In the special case of a least-squares loss, namely the standard linear least-squares problem, we prove that SNGD is equivalent to a regularized Kaczmarz method while SPRING is equivalent to an accelerated regularized Kaczmarz method. As a result, by leveraging existing analyses we obtain under mild conditions (i) the first fast convergence rate for SNGD, (ii) the first convergence guarantee for SPRING in any setting, and (iii) the first proof that SPRING can accelerate SNGD. In the case of a general strongly convex quadratic loss, we extend the analysis of the regularized Kaczmarz method to obtain a fast convergence rate for SNGD under stronger conditions, providing the first explanation for the effectiveness of SNGD outside of the least-squares setting. Overall, our results illustrate how tools from randomized linear algebra can shed new light on the interplay between subsampling and curvature-aware optimization strategies.

Summary

  • The paper demonstrates that SNGD converges at a rate of (1-α)^t while SPRING, equivalent to an accelerated Kaczmarz method, achieves a rate of (1-√(α/β))^t.
  • It interprets the subsampled update steps as randomized projection methods, providing a rigorous theoretical foundation for their efficiency in quadratic problem settings.
  • The study highlights practical implications for hyperparameter tuning, emphasizing the critical role of regularization and momentum in neural network and physics-informed models.

Fast Convergence Rates for Subsampled Natural Gradient Algorithms on Quadratic Model Problems

Introduction and Motivation

This paper provides a rigorous theoretical analysis of subsampled natural gradient descent (SNGD) and its accelerated variant, SPRING, in the context of quadratic model problems. SNGD and SPRING have demonstrated empirical success in scientific machine learning, particularly for neural network wavefunctions (NNWs) and physics-informed neural networks (PINNs), but prior to this work, their fast convergence properties lacked a comprehensive theoretical explanation. The authors address this gap by establishing equivalences between these algorithms and well-studied randomized linear algebra methods, specifically regularized Kaczmarz and its accelerated variants, and by deriving explicit convergence rates under various problem settings.

Problem Formulation and Algorithmic Framework

The analysis focuses on two classes of quadratic model problems:

  • Linear Least-Squares (LLS): L(v)=12∥v−b∥2\mathcal{L}(v) = \frac{1}{2}\|v-b\|^2, with vθ=Jθv_\theta = J\theta.
  • Linear Least-Quadratics (LLQ): L(v)=12v⊤Hv+v⊤q+c\mathcal{L}(v) = \frac{1}{2} v^\top H v + v^\top q + c, with vθ=Jθv_\theta = J\theta and H≻0H \succ 0.

These models are representative of the loss landscapes encountered in PINNs and NNWs, respectively. The key insight is that, under these settings, the SNGD and SPRING updates can be interpreted as randomized projection methods in parameter space, with the stochasticity arising from minibatch subsampling.

The SNGD update is given by:

θt+1=θt−ηJS+(rS)\theta_{t+1} = \theta_t - \eta J_S^+ (r_S)

where JSJ_S and rSr_S are the Jacobian and residuals evaluated on a minibatch SS, and JS+J_S^+ denotes the regularized pseudoinverse. SPRING introduces a momentum term, updating an auxiliary variable ϕt\phi_t alongside θt\theta_t.

Theoretical Results: Equivalences and Convergence Rates

SNGD and SPRING as Randomized Kaczmarz Methods

A central contribution is the demonstration that, for LLS, SNGD is equivalent to the regularized Kaczmarz method, and SPRING is equivalent to the Nesterov-accelerated regularized Kaczmarz (ARK) method. This equivalence enables the transfer of convergence results from randomized linear algebra to the analysis of SNGD and SPRING.

  • SNGD (LLS): Converges at a rate (1−α)t(1-\alpha)^t, where α\alpha is the minimal eigenvalue of the expected projection matrix P‾\overline{P}.
  • SPRING (LLS): Achieves an accelerated rate of (1−α/β)t(1-\sqrt{\alpha/\beta})^t, where β\beta is related to the spectral properties of the projection operator.

This analysis explains the empirical superiority of SNGD over SGD and the further acceleration provided by SPRING, especially for small batch sizes. Figure 1

Figure 1

Figure 1: SGD, SNGD, and SPRING for two randomly generated difficult instances of the LLQ problem, illustrating the superior convergence of SNGD and SPRING.

Extension to Linear Least-Quadratics

For LLQ, the analysis is more nuanced due to the presence of the function-space Hessian HH. The authors introduce a strong consistency assumption, ensuring that the function-space gradient remains in the range of the model Jacobian. Under this assumption, they show that SNGD can be viewed as a regularized Kaczmarz method applied to a transformed problem, and they derive the first explicit fast convergence rate for SNGD in this setting.

  • The convergence rate for SNGD in LLQ is shown to be slower than in LLS by a factor γ\gamma (related to the alignment of the projection and Hessian), but still significantly faster than SGD under realistic spectral decay assumptions for JJ. Figure 2

Figure 2

Figure 2: Eigenvalues of MM for randomly generated problems of two sizes, showing the spectral properties relevant to SNGD convergence.

Figure 3

Figure 3

Figure 3: Effect of regularization parameter λ\lambda on the eigenvalues of MM, demonstrating the necessity of sufficient regularization for convergence.

SPRING for LLQ: Empirical and Conjectural Analysis

While a full theoretical analysis of SPRING for LLQ is left as an open problem, the paper provides empirical evidence that SPRING can dramatically outperform SNGD in this setting. The authors conjecture that the acceleration observed in LLS extends, with some degradation, to LLQ under appropriate conditions. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: SGD, SNGD, and SPRING for a randomly generated LLQ instance with varying batch sizes, highlighting the consistent acceleration of SPRING.

Practical Implications and Implementation Considerations

Algorithmic Implementation

The SNGD and SPRING algorithms are readily implementable in modern autodiff frameworks. The key steps are:

  1. Minibatch Sampling: At each iteration, sample a minibatch SS of size kk.
  2. Jacobian and Residual Computation: Compute JSJ_S and rSr_S via backpropagation and application of the loss operator.
  3. Regularized Pseudoinverse: Compute JS+=JS⊤(JSJS⊤+λI)−1J_S^+ = J_S^\top (J_S J_S^\top + \lambda I)^{-1}, exploiting the Woodbury identity for efficiency when k≪nk \ll n.
  4. Parameter Update: Apply the SNGD or SPRING update as specified.

The computational bottleneck is the formation and inversion of the k×kk \times k matrix JSJS⊤+λIJ_S J_S^\top + \lambda I, which is tractable for moderate kk even when nn is large.

Hyperparameter Selection

  • Regularization (λ\lambda): Sufficiently large λ\lambda is necessary for convergence in LLQ, as demonstrated by the spectral analysis of the update operator.
  • Step Size (η\eta) and Momentum (μ\mu): Should be tuned based on the spectral properties of the problem; the theoretical rates provide guidance for optimal choices.

Resource and Scaling Considerations

  • Memory: Storage of JSJ_S and intermediate matrices scales with knk n.
  • Computation: Each iteration is O(nk2)O(n k^2), which is efficient for small kk and large nn.

Limitations

  • The analysis assumes strong consistency and, for LLQ, additional spectral alignment conditions that may not always hold in practice.
  • The extension to inconsistent or non-quadratic problems remains open.

Theoretical and Practical Implications

The results provide a rigorous foundation for the observed empirical efficiency of SNGD and SPRING in scientific machine learning. The equivalence to randomized Kaczmarz methods clarifies why using the same minibatch for both the gradient and preconditioner is beneficial, and why acceleration via momentum is effective. The analysis also highlights the importance of regularization and the spectral properties of the data in determining convergence rates.

From a practical standpoint, these insights justify the use of SNGD and SPRING in large-scale PINN and NNW applications, and suggest that further algorithmic improvements may be possible by leveraging advances in randomized linear algebra.

Future Directions

  • SPRING for LLQ: A complete theoretical analysis remains to be developed, particularly to quantify the observed acceleration.
  • Beyond Quadratic Models: Extending the framework to inconsistent, non-quadratic, or non-convex settings is a natural next step.
  • Other Curvature-Aware Methods: The approach may generalize to subsampled Newton or Gauss-Newton methods.
  • Algorithmic Innovation: The connection to optimal row-access methods in randomized linear algebra may inspire new optimizers for high-dimensional scientific ML problems.

Conclusion

This work establishes the first fast convergence rates for SNGD and SPRING in quadratic model problems, providing a theoretical explanation for their empirical success in scientific machine learning. By connecting these algorithms to randomized Kaczmarz methods, the authors offer both practical guidance for implementation and a foundation for future advances in curvature-aware optimization under subsampling. The results have significant implications for the design and analysis of scalable optimization algorithms in high-dimensional, ill-conditioned settings.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com