GPU-Based LM Solvers for Scene Reconstruction

Updated 24 December 2025

GPU-based Levenberg–Marquardt solvers are optimization methods that use custom CUDA kernels and a matrix-free approach to efficiently solve large-scale nonlinear least-squares problems.
They leverage a cache-based memory strategy and batch processing to compute Jacobian–vector products, yielding approximately 30% faster convergence compared to ADAM-based methods.
Empirical results demonstrate that these solvers maintain high output quality in 3D scene reconstruction, making them vital for advanced computer vision and graphics applications.

GPU-based Levenberg–Marquardt solvers are high-performance optimization methods for large-scale nonlinear least-squares problems routinely encountered in computer vision and graphics pipelines, such as 3D Gaussian Splatting (3DGS). The 3DGS-LM approach replaces classical first-order optimizers (e.g., ADAM) with a matrix-free, GPU-accelerated implementation of Levenberg–Marquardt (LM), achieving significantly faster convergence for @@@@1@@@@ while maintaining the same output quality (Höllein et al., 2024). This solver is characterized by a sum-of-squares objective, direct GPU implementation via custom CUDA kernels, a cache-based memory strategy for efficient Jacobian-vector computation, and tight integration with batch-based processing for scalability across large datasets.

1. Mathematical Framework and Problem Setup

The optimization target is a scene parameterized by $M$ three-dimensional Gaussians. Each Gaussian is described by a 59-dimensional vector: 11 parameters represent position, orientation, scaling, and opacity, and 48 parameters correspond to spherical-harmonic color coefficients. A dataset of $P$ registered RGB images (each $H \times W$ ) defines the observation space. For every pixel $i$ , two per-pixel losses are measured post-rendering:

$\ell_{1i} = L_{1}$ (rendered color, ground-truth color)
$\ell_{2i} = 1 - SSIM$ (rendered patch, ground-truth patch)

The global energy is a sum of weighted squared residuals:

$E(x) = \sum_{i=1}^{N} \left[ (\sqrt{\lambda_1 \ell_{1i}})^2 + (\sqrt{\lambda_2 \ell_{2i}})^2 \right]$

where $N = 3 \cdot H \cdot W \cdot P$ , $\lambda_1 = 0.2$ , and $\lambda_2 = 0.8$ . The residual vector $F(x) \in \mathbb{R}^N$ is defined entrywise as either $\sqrt{\lambda_1 \ell_{1i}}$ or $\sqrt{\lambda_2 \ell_{2i}}$ . The goal is $\min_{x} \frac{1}{2}\|F(x)\|^2_2$ .

LM proceeds via the Gauss–Newton step with Tikhonov-style damping. The update equation for the $k$ -th iteration is:

$(J^{\top}J + \lambda_{reg} \cdot \mathrm{diag}(J^{\top}J)) \Delta x = -J^{\top} F(x_k)$

where $J$ is the Jacobian of $F$ with respect to $x$ , and $\lambda_{reg} > 0$ is a regularization coefficient adapted per iteration via the standard $\rho$ -test.

2. GPU-based Solver Architecture

The implementation is fully matrix-free, leveraging GPU parallelism and explicit CUDA kernels rather than materializing the massive sparse Jacobian matrix. All matrix computations are performed implicitly through custom operations over a “gradient cache”:

buildCache: For each pixel, forward rasterization is followed by a per-pixel backward pass to store all necessary “pixel-to-splat” ( $\partial \text{pixel}/\partial \text{splat}$ ) gradients as flat cache entries, along with $(\text{pixel index}, \text{splat index})$ and the gradient payload.
diagJTJ: Sums squared gradients per parameter to accumulate diagonal elements of $J^{\top}J$ .
applyJ: Applies $J$ to a vector, streaming over the cache (sorted by Gaussian).
applyJT: Applies $J^{\top}$ , streaming over the cache (sorted by splat).

Cache construction ensures efficient computation and memory coalescing, using sort operations to optimize data layout before each step.

3. Batching, Partial Update Aggregation, and Memory Management

To scale to datasets where the aggregate cache size exceeds GPU memory, the images are divided into $n_b$ batches. For each batch $i$ :

Construct the gradient cache for the batch.
Accumulate batch-specific $J_i^{\top}F_i$ and diagonal preconditioner $1 / \mathrm{diag}(J_i^{\top}J_i)$ .
Solve the regularized normal equations using Preconditioned Conjugate Gradient (PCG, see Section 4).

Partial update directions $\Delta_i$ are combined as a weighted mean across batches, weighted by $\mathrm{diag}(J_i^{\top}J_i)$ , reflecting each batch’s constraint strength on each Gaussian:

$\Delta = \frac{\sum_{i=1}^{n_b} \mathrm{diag}(J_i^{\top}J_i) \cdot \Delta_i}{\sum_{i=1}^{n_b} \mathrm{diag}(J_i^{\top}J_i)}$

A line search is performed on a subset of images to choose step size $γ$ for the update $x_{k+1} = x_k + γ \Delta$ . The damping coefficient $\lambda_{reg}$ is then adjusted by the $\rho$ -test.

4. Preconditioned Conjugate Gradient (PCG) in CUDA

Each LM update solves the regularized normal system with PCG, requiring only implicit matrix–vector multiplies. The workflow is as follows:

Initialize residual $r_0 = b - (J^{\top}J + \lambda D)x_0$ , preconditioned vector $z_0 = M^{-1}r_0$ , search direction $p_0 = z_0$ .
For $T$ $T$ iterations, compute:
- $u_k = Jp_k$ via applyJ
- $g_k = J^{\top}u_k + \lambda D p_k$ via applyJT and diagonal scaling
- $\alpha_k$ , $\beta_k$ and updates to $x$ , $r$ , $z$ , $p$ accordingly

All these steps are implemented as specialized GPU kernels, using the prebuilt gradient cache.

5. Computational Complexity and Efficiency

The dominant cost per LM iteration is in the PCG loop:

Rasterization + cache build: $O(B \cdot H W \cdot K)$ for a batch of $B$ images, with $K$ average splats per pixel.
PCG matvecs: Each iteration costs $O(\text{nnz}(J)) \approx BHWK$ , with three such passes per batch.
Batch combination: $O(n_b M)$ to aggregate updates.

In practice: $B = 25\textrm{--}70$ , $n_b = 3$ –$4$, $T = 8$ PCG iterations; LM requires $\sim 5$ iterations beyond an initial ADAM warmup, whereas ADAM alone requires $\sim 10^4$ iterations. This yields approximately $30\%$ faster end-to-end optimization over the original ADAM-only 3DGS pipeline (Höllein et al., 2024).

6. Empirical Benchmarks and Output Quality

On NVIDIA A100 (Tanks & Temples “Train” scene):

3DGS + ADAM: $\sim$ 736 s, SSIM 0.844, PSNR 23.68, LPIPS 0.178
3DGS + LM: $\sim$ 663 s ( $-10\%$ ), identical SSIM, PSNR, LPIPS

Combining LM with additional system-level speedups (DISTWAR, gsplat, Taming-3DGS) maintains a $\sim$ 30% total speed increase relative to vanilla 3DGS. Output quality remains unchanged according to standard perceptual quality metrics.

7. Key Equations and GPU Code Constructs

The core update equation is the regularized normal equation:

$(J^{\top}J + \lambda_{reg} \cdot \mathrm{diag}(J^{\top}J)) \Delta = -J^{\top}F$

PCG matrix-vector steps are realized as:

applyJ: For each Gaussian $i$ , $u[i] = \sum \text{grad}_e \cdot p[$ gaussianParamIndex $]$
applyJT: For each ray $j$ , $g[j] = \sum \text{grad}_e \cdot u[$ pixelIndex $]$
diagJTJ: For parameter $j$ , $M[j] = \sum (\text{grad}_e)^2$

CUDA pseudocode for per-pixel cache build:

// buildCache kernel (per pixel):
for each pixel thread t:
    trace splats s1...sK
    compute alpha-blend state up to sk
    compute dr/dsk
    write {pixel=t, splat=sk, grad=dr/dsk} to cache
// sortCache with Thrust/radix sort by splat->gaussian id

Efficient implementation follows this workflow for matrix-free LM updates, leveraging full GPU parallelism by explicit kernel launches and gradient caching.

For further details on implementation specifics and the complete pipeline, see (Höllein et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU-based Levenberg–Marquardt Solvers.

GPU-Based LM Solvers for Scene Reconstruction

1. Mathematical Framework and Problem Setup

2. GPU-based Solver Architecture

3. Batching, Partial Update Aggregation, and Memory Management

4. Preconditioned Conjugate Gradient (PCG) in CUDA

5. Computational Complexity and Efficiency

6. Empirical Benchmarks and Output Quality

7. Key Equations and GPU Code Constructs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GPU-Based LM Solvers for Scene Reconstruction

1. Mathematical Framework and Problem Setup

2. GPU-based Solver Architecture

3. Batching, Partial Update Aggregation, and Memory Management

4. Preconditioned Conjugate Gradient (PCG) in CUDA

5. Computational Complexity and Efficiency

6. Empirical Benchmarks and Output Quality

7. Key Equations and GPU Code Constructs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research