High-Precision Gauss–Newton Optimizer

Updated 28 April 2026

The paper presents a high-precision Gauss–Newton optimizer that achieves near machine accuracy by combining advanced linear algebra, damping, and mixed-precision arithmetic.
It reduces computational cost in large-scale settings by exploiting low-rank factorizations and block-sparse structures, making it suitable for deep neural networks and physics-informed models.
The optimizer offers robust convergence guarantees with local quadratic rates and has been successfully adapted for various high-precision applications in scientific computing.

A high-precision Gauss–Newton optimizer is an algorithmic framework that leverages the Gauss–Newton (GN) or generalized Gauss–Newton (GGN) approximation to the Hessian matrix for the efficient, accurate solution of nonlinear least-squares and related problems. High-precision variants employ numerically stable solvers, damping or regularization, and often exploit advanced linear-algebraic identities or mixed-precision arithmetic to compute search directions to near machine accuracy at moderate computational cost. Such methods are critical for applications where first-order methods either stall at suboptimal accuracy or suffer prohibitively slow convergence in ill-conditioned or overparameterized settings. This article provides a detailed synthesis of modern advances, covering theoretical convergence, efficient algorithms for large-scale machine learning and scientific computing, and domain-specific adaptations.

1. Mathematical Structure of the High-Precision Gauss–Newton Method

Let $L(\theta) = \frac{1}{2} \|r(\theta)\|^2$ denote the standard nonlinear least-squares objective, where $\theta \in \mathbb{R}^d$ are the parameters and $r: \mathbb{R}^d \to \mathbb{R}^m$ is the residual vector. The gradient and exact Hessian are given by

$g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$

where $J(\theta)$ is the Jacobian of $r$ and $S(\theta) = \sum_{i=1}^{m} r_i(\theta)\, \nabla^2_\theta r_i(\theta)$ . The GN approximation neglects the second term, yielding $H^{GN}(\theta) \approx J(\theta)^T J(\theta)$ . For regularized or generalized settings, e.g., for deep neural networks (DNNs) or PDE-constrained models, a block-diagonal or structure-aware $Q$ replaces the identity: $H^{GN} = J^T Q J$ (Korbit et al., 2024).

The high-precision update solves the regularized linear system

$\theta \in \mathbb{R}^d$ 0

with $\theta \in \mathbb{R}^d$ 1 (Levenberg–Marquardt damping) to improve conditioning. Step-size selection is typically handled via an Armijo or trust-region line search; variance reduction and adaptive regularization are used in stochastic regimes (Korbit et al., 2024).

2. Algorithmic Implementations: Linear Algebra and Low-Rank Factorizations

Naïvely, solving the $\theta \in \mathbb{R}^d$ 2 linear system is cubic in $\theta \in \mathbb{R}^d$ 3 and quickly becomes intractable for modern large-scale settings. High-precision Gauss–Newton algorithms leverage low-rank and block structures to reduce computational cost:

In supervised DNN training, the Duncan–Guttman identity rewrites the update as

$\theta \in \mathbb{R}^d$ 4

where $\theta \in \mathbb{R}^d$ 5 for mini-batch size $\theta \in \mathbb{R}^d$ 6 and output dimension $\theta \in \mathbb{R}^d$ 7 (Korbit et al., 2024). This reduces the main solve to a $\theta \in \mathbb{R}^d$ 8 system, with total per-step cost $\theta \in \mathbb{R}^d$ 9—vastly more efficient than $r: \mathbb{R}^d \to \mathbb{R}^m$ 0 when $r: \mathbb{R}^d \to \mathbb{R}^m$ 1.

In tensor decomposition, Hessian–vector products in the CP decomposition case are performed implicitly via $r: \mathbb{R}^d \to \mathbb{R}^m$ 2 block tensor contractions, enabling parallelization and CG-based solvers for high-accuracy solves (Singh et al., 2019).
For domain decomposition architectures in physics-informed neural networks (PINNs), Gauss–Newton steps are realized as block-sparse linear systems due to locality of parameter influence, supporting both direct block-factorizations and efficient iterative schemes that exploit sparsity (Heinlein et al., 30 Oct 2025).

Compact high-level pseudocode for the GN update with momentum, line search, and adaptive damping is as follows (Korbit et al., 2024): $r$ 6 Conjugate gradient solvers with tight tolerances are frequently used to achieve high-precision directions when even the small-system inversions are too costly to perform directly (Gargiani et al., 2020, Singh et al., 2019).

3. Convergence Guarantees and Limiting Behavior

High-precision Gauss–Newton optimizers typically admit strong local convergence guarantees and, under standard conditions, global convergence to stationary points up to the noise floor or finite-precision limit.

For DNNs and general stochastic objectives, under mild assumptions (bounded variance, Lipschitz gradients/hessian blocks, and appropriate batch size), the EGN algorithm achieves

$r: \mathbb{R}^d \to \mathbb{R}^m$ 3

implying $r: \mathbb{R}^d \to \mathbb{R}^m$ 4 iteration complexity for reaching an $r: \mathbb{R}^d \to \mathbb{R}^m$ 5-stationary point (Korbit et al., 2024).

For nonconvex low-rank matrix factorization, global convergence to a stationary point is established under full-rank and boundedness conditions (on the singular values of $r: \mathbb{R}^d \to \mathbb{R}^m$ 6), with local quadratic convergence if the residual vanishes or local linear if only "small" (Tran-Dinh, 2016).
Mixed-precision variants (GN $r: \mathbb{R}^d \to \mathbb{R}^m$ 7) guarantee local linear or superlinear convergence down to a limiting accuracy dictated by the working precision and solver backward error, often expressed as

$r: \mathbb{R}^d \to \mathbb{R}^m$ 8

where $r: \mathbb{R}^d \to \mathbb{R}^m$ 9 are the unit roundoffs for gradient and Hessian computations, and $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 0 summarizes additional floating-point error sources (Carrino et al., 20 Nov 2025).

4. Numerical Strategies for High Precision

Achieving high-precision convergence requires careful attention to numerical stability and regularization:

Levenberg–Marquardt damping $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 1 is ubiquitous to maintain positive definiteness and counter near-rank-deficiency, with adaptive schemes (increased or decreased based on progress) for further efficiency (Korbit et al., 2024, Tran-Dinh, 2016, Carrino et al., 20 Nov 2025).
Line search (Armijo, Wolfe, or trust-region) robustifies step selection, especially for nonconvex problems or when the local quadratic model is unreliable (Brooks, 2023, Tran-Dinh, 2016).
In stochastic or large-scale settings, Monte Carlo evaluation of matrix blocks, matrix-free CG solves, and exploitation of block-sparse or low-rank structures are crucial for both accuracy and scalability (Jnini et al., 2024, Heinlein et al., 30 Oct 2025, Singh et al., 2019).
Mixed-precision implementations prioritize highest available precision for gradient/residual evaluation, medium for assembling and updating the Hessian, and lowest viable for linear solves (subject to solver backward-error limitations), with stopping criteria based on both residual norm and backward error (Carrino et al., 20 Nov 2025).

5. Domain-Specific Adaptations and Examples

High-precision GN schemes have been specialized across domains:

Deep Neural Networks (DNNs): Exact GN methods (EGN) outperform or match SGD, Adam, and SGN on regression and classification tasks. For instance, on California Housing, Superconductivity, and Diamonds benchmarks, EGN attains lower test RMSE than Adam and SGD with similar or improved wall-clock costs (Korbit et al., 2024).
Physics-Informed Neural Networks and PDEs: Block-sparse GN, GN natural gradient, and function-space GN methods for PINN frameworks achieve relative $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 2 errors of $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 3– $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 4 (single-precision floor), with efficient matrix-free AD + CG implementation (Jnini et al., 2024, Heinlein et al., 30 Oct 2025).
Low-Rank Matrix Factorization and Completion: GN and ADMM-GN schemes solve large nonconvex matrix problems to accuracies of $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 5– $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 6 in $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 7– $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 8 iterations, far exceeding alternating minimization, LMaFit, or Frank–Wolfe in final residual and convergence speed (Tran-Dinh, 2016).
Tensor Decomposition: GN with implicit Hessian computation and preconditioned CG achieves machine-precision relative error in a fraction of the time and iterations required by ALS, with better scalability and parallel efficiency (Singh et al., 2019).
Shape Learning and Geometric Optimization: GN natural-gradient methods address ill-conditioned geometric constraints and enhance convergence to high-precision solutions relative to first-order or quasi-Newton baselines (King et al., 24 Jan 2026).
Variational Monte Carlo: Rayleigh–Gauss–Newton steps achieve superlinear convergence in noiseless problems, and with enhanced sampling (parallel tempering) yield ground-state energy accuracies of $g(\theta) = J(\theta)^{T} r(\theta), \quad H(\theta) = J(\theta)^T J(\theta) + S(\theta),$ 9– $J(\theta)$ 0 after only 200 updates, considerably surpassing natural gradient (Webber et al., 2021).

6. Extensions, Higher-Order Corrections, and Limitations

Several extensions and refinements have been developed:

Higher-Order Corrections: By systematically including geodesic acceleration and higher derivatives, convergence can be further accelerated in curved valleys; these corrections are particularly valuable in ill-conditioned settings, with significant reductions in iteration counts even for high anisotropy (Brooks, 2023).
q-Gauss–Newton Methods: Incorporating $J(\theta)$ 1-derivatives interpolates between first-order and classical GN, providing flexible tradeoffs between global attraction and local quadratic convergence; $J(\theta)$ 2 yields standard GN (Protic et al., 2021).
Partial Second-Order (GN $J(\theta)$ 3): Selectively including only the largest-residual second-order terms provides near-Newton convergence at reduced cost, with careful precision management (Carrino et al., 20 Nov 2025).
Motion Optimization: Convergence of the true Hessian to the GN approximation can be rigorously quantified as $J(\theta)$ 4 for discrete-time trajectory optimization, ensuring that for sufficiently fine discretizations, the GN optimizer captures true curvature up to the tolerance limit (Ratliff et al., 2016).

Principal limitations include the need for a sufficiently close initialization (as GN remains a local method), matrix inversion costs for extremely large mini-batches or parameter spaces (requiring iterative solvers), and subtle numerical issues when residuals or Jacobian singular values are very small.

7. Comparative Summary and Practical Guidelines

The following table summarizes key comparisons for high-precision GN-type optimizers in different domains, capturing the main findings:

Domain	High-Precision GN Method	Accuracy Attained	Main Comparative Baseline
DNN optimization	EGN, SGN (+CG, damping)	$J(\theta)$ 5– $J(\theta)$ 6	SGD/Adam (slower to converge)
CP tensor decomposition	GN (implicit Hessian + CG)	$J(\theta)$ 7	ALS (stalls at $J(\theta)$ 8)
PINN for Navier–Stokes	Matrix-free GN Natural Gradient	$J(\theta)$ 9	Adam/BFGS ( $r$ 0– $r$ 1)
Low-rank matrix factor	Extended GN / ADMM-GN	$r$ 2– $r$ 3	Alt-min (saturates at $r$ 4)
Variational Monte Carlo	Rayleigh–Gauss–Newton	$r$ 5	Natural grad (few digits)

Empirically, high-precision GN methods consistently:

Attain lower residual norms or higher test set metrics in fewer wall-clock time or iterations relative to first-order methods.
Exhibit greater robustness to hyperparameter choices, owing to stable quadratic or quasi-quadratic convergence and built-in regularization.
Deliver solutions closer to the limit of floating-point precision when solver tolerances and damping are carefully chosen.

For best results, practitioners should:

Use the highest affordable precision for residual and gradient computations.
Exploit problem structure (block-sparsity, low-rank, local coupling) for scalability.
Employ adaptive damping/regularization and line search.
Monitor solver backward errors, convergence rate, and limit accuracy in mixed/multiple precision settings.

High-precision Gauss–Newton optimizers form a flexible, theoretically grounded, and practically validated foundation for next-generation optimization in large-scale, ill-conditioned, and accuracy-critical problems across scientific and machine learning domains (Korbit et al., 2024, Singh et al., 2019, Tran-Dinh, 2016, Jnini et al., 2024, Ratliff et al., 2016, Carrino et al., 20 Nov 2025, Brooks, 2023, Heinlein et al., 30 Oct 2025, King et al., 24 Jan 2026, Webber et al., 2021, Gargiani et al., 2020, Protic et al., 2021).