Rayleigh-Gauss-Newton: VMC Optimization

Updated 4 July 2026

Rayleigh-Gauss-Newton (RGN) is a variational Monte Carlo optimizer that minimizes the Rayleigh quotient using a Gauss-Newton style approximation of the local energy curvature.
It constructs a quadratic model relying on first derivatives and a regularized Fubini-Study metric to efficiently capture curvature without full second-order derivatives.
Empirical studies on spin models demonstrate that RGN achieves superlinear convergence and robust performance, especially when paired with enhanced sampling like parallel tempering.

Searching arXiv for recent and foundational papers on Rayleigh-Gauss-Newton and adjacent methods. Rayleigh-Gauss-Newton (RGN) is a second-order-style optimization method introduced for variational Monte Carlo (VMC) training of parameterized wavefunctions, especially neural-network wavefunctions. It is formulated for minimizing the Rayleigh quotient

$\mathcal{E}[\psi] = \frac{\langle \psi,\mathcal H\psi\rangle}{\langle \psi,\psi\rangle},$

and is motivated by the observation that, near an eigenstate, the local curvature of the VMC energy is dominated by a first-derivative-based Hessian-like term. In that regime, RGN replaces a full second-order treatment by a Gauss-Newton-style approximation that uses first derivatives of the wavefunction together with a metric regularization, yielding a linear-system update that is theoretically superlinear in a noiseless local regime and practically effective when combined with enhanced sampling such as parallel tempering (Webber et al., 2021).

1. Rayleigh quotient structure and the origin of the method

The defining optimization objective in VMC is the Rayleigh quotient of the Hamiltonian over a parameterized family $\psi=\psi_{\boldsymbol\theta}$ . The “Rayleigh” component of the name refers to this energy functional, not to classical Rayleigh quotient iteration for eigenvalue problems. The method is developed by analyzing the local expansion of the normalized wavefunction $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ and its induced energy variation (Webber et al., 2021).

With intermediate normalization and a Taylor expansion, the local energy difference is written as

$\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$

Here $\boldsymbol g$ is the energy gradient term, $\boldsymbol H$ is a Hessian-like matrix constructed from first derivatives of the wavefunction, and $\boldsymbol J$ contains second derivatives of the wavefunction. The central theoretical observation is that $\boldsymbol J\to 0$ as the wavefunction approaches an eigenstate, under mild boundedness assumptions. This makes $\boldsymbol H$ the dominant local curvature object near the solution and provides the specific justification for a Gauss-Newton-style approximation in the VMC setting (Webber et al., 2021).

The method therefore occupies an intermediate position between first-order VMC optimizers and full second-order approaches. It uses more local curvature information than gradient descent or natural gradient descent, but avoids the full second-derivative burden associated with an exact Hessian.

2. Update rule, quadratic model, and relation to other VMC optimizers

RGN is expressed in a unified preconditioned-gradient form,

$\boldsymbol P^i(\boldsymbol\theta^{i+1}-\boldsymbol\theta^i)=-\boldsymbol g(\boldsymbol\theta^i).$

The specific RGN step is obtained by minimizing the regularized quadratic model

$\psi=\psi_{\boldsymbol\theta}$ 0

which yields

$\psi=\psi_{\boldsymbol\theta}$ 1

Here $\psi=\psi_{\boldsymbol\theta}$ 2 is the Fubini-Study metric / quantum information metric,

$\psi=\psi_{\boldsymbol\theta}$ 3

The role of $\psi=\psi_{\boldsymbol\theta}$ 4 is that of a penalty or regularization parameter, and $\psi=\psi_{\boldsymbol\theta}$ 5 adds metric regularization (Webber et al., 2021).

The derivation is explicitly Gauss-Newton in spirit. The quadratic approximation retained by RGN is

$\psi=\psi_{\boldsymbol\theta}$ 6

with the second-derivative term involving $\psi=\psi_{\boldsymbol\theta}$ 7 intentionally omitted because computing all second derivatives of the wavefunction is expensive for large parameter counts. The approximation is thus not an arbitrary truncation; it is matched to the local eigenstate geometry established in the theory.

The comparison to other VMC optimizers is structurally clear.

Method	Update / preconditioner	Characteristic feature
Gradient descent	$\psi=\psi_{\boldsymbol\theta}$ 8	Euclidean step
Natural gradient descent	$\psi=\psi_{\boldsymbol\theta}$ 9	Metric-preconditioned step
Rayleigh-Gauss-Newton	$\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 0	Curvature plus metric

Natural gradient descent uses only the geometry of the parameterization through $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 1, whereas RGN additionally incorporates the local energy curvature through $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 2. The linear method, by contrast, minimizes a Rayleigh quotient over the linearized wavefunction $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 3, which leads to a generalized eigenvalue problem. RGN instead solves a regularized linear system, and the two agree up to $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 4 (Webber et al., 2021).

3. Local convergence theory and the superlinear regime

The convergence analysis of RGN is local and is developed first in a noiseless setting. The relevant generic update class is

$\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 5

with $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 6 positive definite. The main theorem bounds the asymptotic energy reduction rate by how accurately $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 7 approximates the true local curvature (Webber et al., 2021): $\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 8 In the complex-valued case, the corresponding object is the block Hessian / Wirtinger Hessian

$\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}$ 9

For RGN, the key asymptotic mechanism is that near an eigenstate the troublesome second-derivative contribution $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 0 vanishes, so a preconditioner built from $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 1 can approach the true local Hessian. Under the conditions emphasized in the paper—convergence of the iterates to a local minimizer, positive definiteness of the Hessian or Wirtinger Hessian at the minimizer, local analyticity of the wavefunction in the parameters, and a penalty schedule such that the effective RGN preconditioner approaches the local curvature with $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 2—the asymptotic energy ratio satisfies

This is the paper’s superlinear convergence statement (Webber et al., 2021).

A common misconception is to treat this as a global guarantee. The analysis is explicitly local. The favorable rate depends on the iterates already entering a regime where the eigenstate-based curvature simplification is accurate and where the regularized preconditioner tracks the true Hessian sufficiently well.

4. Sampling noise, vanishing variance, and enhanced sampling

In practical VMC, the quantities entering RGN are estimated stochastically from samples $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 4. The paper introduces estimators based on the local energy

and logarithmic derivatives

It then proves a vanishing-variance principle: under geometric ergodicity and mild moment assumptions, the asymptotic variances of the energy and gradient estimators vanish when the local energy becomes constant, that is, at an eigenstate (Webber et al., 2021).

Away from an eigenstate, however, the stochastic estimators can have substantial variance. This matters acutely for RGN because curvature-based methods are more sensitive than purely first-order methods to sampling error in $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 7, $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 8, and especially $\mathcal{E}[\widehat\psi_{\boldsymbol\theta+\boldsymbol\delta}] - \mathcal{E}[\widehat\psi_{\boldsymbol\theta}] = \boldsymbol\delta^\ast \boldsymbol g + \boldsymbol g^\ast \boldsymbol\delta + \boldsymbol\delta^\ast \boldsymbol H\,\boldsymbol\delta + \Re\!\left(\boldsymbol\delta^T \overline{\boldsymbol J\,\boldsymbol\delta}\right) + \mathcal O(|\boldsymbol\delta|^3).$ 9. The practical implementation therefore uses stabilization heuristics, including shrinking $\boldsymbol g$ 0 when an update becomes too large.

The principal sampling innovation paired with RGN is parallel tempering. The method introduces a ladder of intermediate distributions

$\boldsymbol g$ 1

with swap moves between adjacent temperatures. The purpose is to prevent Markov chains from remaining trapped in metastable regions of configuration space. The paper reports that standard MCMC can generate large energy spikes when chains move between such regions, after which optimization may require on the order of $\boldsymbol g$ 2 steps to recover. Parallel tempering reduces both the size and duration of these spikes and makes RGN robust to them (Webber et al., 2021).

This pairing of optimizer and sampler is central to the method as actually used. In the cited work, RGN is not presented as a curvature update in isolation, but as part of an optimization-and-sampling scheme for stochastic Rayleigh quotient minimization.

5. Empirical behavior on transverse-field Ising and XXZ models

The experimental study uses restricted Boltzmann machine wavefunctions on transverse-field Ising (TFI) and XXZ spin models, including large one-dimensional and two-dimensional lattices. In a noiseless $\boldsymbol g$ 3 TFI benchmark, where energies and derivatives are computed exactly by summation, RGN converges faster and to lower error than gradient descent, natural gradient descent, and the linear method, and the Hessian approximation used by RGN has relative errors below $\boldsymbol g$ 4 for most iterations (Webber et al., 2021).

For large systems, the reported relative-error comparisons after 1000 iterations show consistent advantages for RGN.

System	Natural GD	RGN
$\boldsymbol g$ 5 TFI	$\boldsymbol g$ 6, $\boldsymbol g$ 7, $\boldsymbol g$ 8	$\boldsymbol g$ 9, $\boldsymbol H$ 0, $\boldsymbol H$ 1
$\boldsymbol H$ 2 XXZ	$\boldsymbol H$ 3, $\boldsymbol H$ 4, $\boldsymbol H$ 5	$\boldsymbol H$ 6, $\boldsymbol H$ 7, $\boldsymbol H$ 8

For the $\boldsymbol H$ 9 TFI model, the paper states that RGN improves accuracy by up to four orders of magnitude relative to natural gradient descent. On the $\boldsymbol J$ 0 TFI model, RGN reaches roughly 4–6 significant digits after 200 iterations, while natural gradient descent is much less accurate at the same iteration count; RGN after 200 iterations often matches or exceeds natural gradient descent after 1000 iterations (Webber et al., 2021).

The per-iteration runtime is somewhat higher but remains comparable. Reported runtimes per 1000 optimization steps on a 48-core CPU node are 18–21 hours for RGN versus 12–14 hours for natural gradient descent on $\boldsymbol J$ 1 TFI, 58–63 hours versus 30–32 hours on $\boldsymbol J$ 2 TFI, and 97–100 hours versus 85–90 hours on $\boldsymbol J$ 3 XXZ. The same work argues theoretically that RGN can improve upon gradient descent and natural gradient descent to achieve superlinear convergence at no more than twice the computational cost, and experimentally reports RGN to be less than a factor of two more expensive than natural gradient descent while often achieving substantially better accuracy in far fewer iterations (Webber et al., 2021).

6. Terminological scope and relation to adjacent methods

In the cited literature, “Rayleigh-Gauss-Newton” denotes a specific VMC optimizer rather than a generic umbrella term. Several neighboring methods combine Rayleigh-type structure or Gauss-Newton structure, but they are conceptually distinct.

First, generalized Rayleigh Quotient Iteration (RQI) has been formulated as a Newton/Gauss-Newton method on a constrained manifold, where a generalized Rayleigh quotient eliminates multiplier variables and the increment is computed either in Schur form or in Newton form. That framework covers linear and nonlinear eigenvalue problems, constrained optimization, and tensor eigenpair problems, and establishes cubic convergence through a constrained Chebyshev term related to a second covariant derivative (Nguyen, 2019). The commonality with RGN lies in the coexistence of Rayleigh-type elimination and Newton/Gauss-Newton reasoning, but the algorithmic setting is different: RQI is a constrained manifold iteration for nonlinear equations or eigenproblems, whereas RGN is an optimizer for stochastic Rayleigh quotient minimization in VMC.

Second, the q-Gauss-Newton method is a q-calculus analogue of classical Gauss-Newton for unconstrained nonlinear least-squares problems. It replaces ordinary derivatives by q-derivatives and the Jacobian by a q-Jacobian, but the source explicitly states that it does not present a dedicated “Rayleigh-Gauss-Newton” development or a Rayleigh quotient derivation (Protic et al., 2021).

Third, the Recursive Gauss-Newton Filter (RGNF) is a memory-efficient recursive estimator derived from the classical Gauss-Newton filter. It uses weighted least squares, Newton-style local linearization, recursive updates for the normal-equation matrix and right-hand side, and a Levenberg-Marquardt adaptation for robustness. Its domain is nonlinear state estimation and tracking, not Rayleigh quotient minimization (Nadjiasngar et al., 2011).

Fourth, Gauss-Newton-type methods for bilevel optimization formulate necessary optimality conditions as an overdetermined nonlinear system, solve the associated least-squares problem by normal equations, and introduce pseudo-inverse and smoothing variants. The source explicitly notes that it does not introduce a Rayleigh quotient-specific Gauss-Newton method (Fliege et al., 2020). Likewise, relaxed inexact proximal Gauss-Newton methods for inverse problems add proximal regularization and relaxed steps for convergence in nonsmooth composite objectives (Jauhiainen et al., 2020), while full-space interior-point Gauss-Newton methods for PDE-constrained optimization use Gauss-Newton KKT linearizations and spectral preconditioning to achieve scalability under bound constraints (Hartland et al., 2024).

This suggests that the distinctive feature of RGN, as presently named, is not merely the presence of Gauss-Newton curvature or a Rayleigh quotient somewhere in the formulation, but the specific identification of a first-derivative-based local Hessian model for VMC energy minimization and its coupling with stochastic sampling control (Webber et al., 2021).