Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPU-Based LM Solvers for Scene Reconstruction

Updated 24 December 2025
  • GPU-based Levenberg–Marquardt solvers are optimization methods that use custom CUDA kernels and a matrix-free approach to efficiently solve large-scale nonlinear least-squares problems.
  • They leverage a cache-based memory strategy and batch processing to compute Jacobian–vector products, yielding approximately 30% faster convergence compared to ADAM-based methods.
  • Empirical results demonstrate that these solvers maintain high output quality in 3D scene reconstruction, making them vital for advanced computer vision and graphics applications.

GPU-based Levenberg–Marquardt solvers are high-performance optimization methods for large-scale nonlinear least-squares problems routinely encountered in computer vision and graphics pipelines, such as 3D Gaussian Splatting (3DGS). The 3DGS-LM approach replaces classical first-order optimizers (e.g., ADAM) with a matrix-free, GPU-accelerated implementation of Levenberg–Marquardt (LM), achieving significantly faster convergence for @@@@1@@@@ while maintaining the same output quality (Höllein et al., 2024). This solver is characterized by a sum-of-squares objective, direct GPU implementation via custom CUDA kernels, a cache-based memory strategy for efficient Jacobian-vector computation, and tight integration with batch-based processing for scalability across large datasets.

1. Mathematical Framework and Problem Setup

The optimization target is a scene parameterized by MM three-dimensional Gaussians. Each Gaussian is described by a 59-dimensional vector: 11 parameters represent position, orientation, scaling, and opacity, and 48 parameters correspond to spherical-harmonic color coefficients. A dataset of PP registered RGB images (each H×WH \times W) defines the observation space. For every pixel ii, two per-pixel losses are measured post-rendering:

  • 1i=L1\ell_{1i} = L_{1}(rendered color, ground-truth color)
  • 2i=1SSIM\ell_{2i} = 1 - SSIM(rendered patch, ground-truth patch)

The global energy is a sum of weighted squared residuals:

E(x)=i=1N[(λ11i)2+(λ22i)2]E(x) = \sum_{i=1}^{N} \left[ (\sqrt{\lambda_1 \ell_{1i}})^2 + (\sqrt{\lambda_2 \ell_{2i}})^2 \right]

where N=3HWPN = 3 \cdot H \cdot W \cdot P, λ1=0.2\lambda_1 = 0.2, and λ2=0.8\lambda_2 = 0.8. The residual vector F(x)RNF(x) \in \mathbb{R}^N is defined entrywise as either λ11i\sqrt{\lambda_1 \ell_{1i}} or λ22i\sqrt{\lambda_2 \ell_{2i}}. The goal is minx12F(x)22\min_{x} \frac{1}{2}\|F(x)\|^2_2.

LM proceeds via the Gauss–Newton step with Tikhonov-style damping. The update equation for the kk-th iteration is:

(JJ+λregdiag(JJ))Δx=JF(xk)(J^{\top}J + \lambda_{reg} \cdot \mathrm{diag}(J^{\top}J)) \Delta x = -J^{\top} F(x_k)

where JJ is the Jacobian of FF with respect to xx, and λreg>0\lambda_{reg} > 0 is a regularization coefficient adapted per iteration via the standard ρ\rho-test.

2. GPU-based Solver Architecture

The implementation is fully matrix-free, leveraging GPU parallelism and explicit CUDA kernels rather than materializing the massive sparse Jacobian matrix. All matrix computations are performed implicitly through custom operations over a “gradient cache”:

  • buildCache: For each pixel, forward rasterization is followed by a per-pixel backward pass to store all necessary “pixel-to-splat” (pixel/splat\partial \text{pixel}/\partial \text{splat}) gradients as flat cache entries, along with (pixel index,splat index)(\text{pixel index}, \text{splat index}) and the gradient payload.
  • diagJTJ: Sums squared gradients per parameter to accumulate diagonal elements of JJJ^{\top}J.
  • applyJ: Applies JJ to a vector, streaming over the cache (sorted by Gaussian).
  • applyJT: Applies JJ^{\top}, streaming over the cache (sorted by splat).

Cache construction ensures efficient computation and memory coalescing, using sort operations to optimize data layout before each step.

3. Batching, Partial Update Aggregation, and Memory Management

To scale to datasets where the aggregate cache size exceeds GPU memory, the images are divided into nbn_b batches. For each batch ii:

  • Construct the gradient cache for the batch.
  • Accumulate batch-specific JiFiJ_i^{\top}F_i and diagonal preconditioner 1/diag(JiJi)1 / \mathrm{diag}(J_i^{\top}J_i).
  • Solve the regularized normal equations using Preconditioned Conjugate Gradient (PCG, see Section 4).

Partial update directions Δi\Delta_i are combined as a weighted mean across batches, weighted by diag(JiJi)\mathrm{diag}(J_i^{\top}J_i), reflecting each batch’s constraint strength on each Gaussian:

Δ=i=1nbdiag(JiJi)Δii=1nbdiag(JiJi)\Delta = \frac{\sum_{i=1}^{n_b} \mathrm{diag}(J_i^{\top}J_i) \cdot \Delta_i}{\sum_{i=1}^{n_b} \mathrm{diag}(J_i^{\top}J_i)}

A line search is performed on a subset of images to choose step size γγ for the update xk+1=xk+γΔx_{k+1} = x_k + γ \Delta. The damping coefficient λreg\lambda_{reg} is then adjusted by the ρ\rho-test.

4. Preconditioned Conjugate Gradient (PCG) in CUDA

Each LM update solves the regularized normal system with PCG, requiring only implicit matrix–vector multiplies. The workflow is as follows:

  • Initialize residual r0=b(JJ+λD)x0r_0 = b - (J^{\top}J + \lambda D)x_0, preconditioned vector z0=M1r0z_0 = M^{-1}r_0, search direction p0=z0p_0 = z_0.
  • For TT iterations, compute:
    • uk=Jpku_k = Jp_k via applyJ
    • gk=Juk+λDpkg_k = J^{\top}u_k + \lambda D p_k via applyJT and diagonal scaling
    • αk\alpha_k, βk\beta_k and updates to xx, rr, zz, pp accordingly

All these steps are implemented as specialized GPU kernels, using the prebuilt gradient cache.

5. Computational Complexity and Efficiency

The dominant cost per LM iteration is in the PCG loop:

  • Rasterization + cache build: O(BHWK)O(B \cdot H W \cdot K) for a batch of BB images, with KK average splats per pixel.
  • PCG matvecs: Each iteration costs O(nnz(J))BHWKO(\text{nnz}(J)) \approx BHWK, with three such passes per batch.
  • Batch combination: O(nbM)O(n_b M) to aggregate updates.

In practice: B=2570B = 25\textrm{--}70, nb=3n_b = 3–$4$, T=8T = 8 PCG iterations; LM requires 5\sim 5 iterations beyond an initial ADAM warmup, whereas ADAM alone requires 104\sim 10^4 iterations. This yields approximately 30%30\% faster end-to-end optimization over the original ADAM-only 3DGS pipeline (Höllein et al., 2024).

6. Empirical Benchmarks and Output Quality

On NVIDIA A100 (Tanks & Temples “Train” scene):

  • 3DGS + ADAM: \sim736 s, SSIM 0.844, PSNR 23.68, LPIPS 0.178
  • 3DGS + LM: \sim663 s (10%-10\%), identical SSIM, PSNR, LPIPS

Combining LM with additional system-level speedups (DISTWAR, gsplat, Taming-3DGS) maintains a \sim30% total speed increase relative to vanilla 3DGS. Output quality remains unchanged according to standard perceptual quality metrics.

7. Key Equations and GPU Code Constructs

The core update equation is the regularized normal equation:

(JJ+λregdiag(JJ))Δ=JF(J^{\top}J + \lambda_{reg} \cdot \mathrm{diag}(J^{\top}J)) \Delta = -J^{\top}F

PCG matrix-vector steps are realized as:

  • applyJ: For each Gaussian ii, u[i]=gradep[u[i] = \sum \text{grad}_e \cdot p[gaussianParamIndex]]
  • applyJT: For each ray jj, g[j]=gradeu[g[j] = \sum \text{grad}_e \cdot u[pixelIndex]]
  • diagJTJ: For parameter jj, M[j]=(grade)2M[j] = \sum (\text{grad}_e)^2

CUDA pseudocode for per-pixel cache build:

1
2
3
4
5
6
7
// buildCache kernel (per pixel):
for each pixel thread t:
    trace splats s1...sK
    compute alpha-blend state up to sk
    compute dr/dsk
    write {pixel=t, splat=sk, grad=dr/dsk} to cache
// sortCache with Thrust/radix sort by splat->gaussian id

Efficient implementation follows this workflow for matrix-free LM updates, leveraging full GPU parallelism by explicit kernel launches and gradient caching.


For further details on implementation specifics and the complete pipeline, see (Höllein et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU-based Levenberg–Marquardt Solvers.