GPU-Based LM Solvers for Scene Reconstruction
- GPU-based Levenberg–Marquardt solvers are optimization methods that use custom CUDA kernels and a matrix-free approach to efficiently solve large-scale nonlinear least-squares problems.
- They leverage a cache-based memory strategy and batch processing to compute Jacobian–vector products, yielding approximately 30% faster convergence compared to ADAM-based methods.
- Empirical results demonstrate that these solvers maintain high output quality in 3D scene reconstruction, making them vital for advanced computer vision and graphics applications.
GPU-based Levenberg–Marquardt solvers are high-performance optimization methods for large-scale nonlinear least-squares problems routinely encountered in computer vision and graphics pipelines, such as 3D Gaussian Splatting (3DGS). The 3DGS-LM approach replaces classical first-order optimizers (e.g., ADAM) with a matrix-free, GPU-accelerated implementation of Levenberg–Marquardt (LM), achieving significantly faster convergence for @@@@1@@@@ while maintaining the same output quality (Höllein et al., 2024). This solver is characterized by a sum-of-squares objective, direct GPU implementation via custom CUDA kernels, a cache-based memory strategy for efficient Jacobian-vector computation, and tight integration with batch-based processing for scalability across large datasets.
1. Mathematical Framework and Problem Setup
The optimization target is a scene parameterized by three-dimensional Gaussians. Each Gaussian is described by a 59-dimensional vector: 11 parameters represent position, orientation, scaling, and opacity, and 48 parameters correspond to spherical-harmonic color coefficients. A dataset of registered RGB images (each ) defines the observation space. For every pixel , two per-pixel losses are measured post-rendering:
- (rendered color, ground-truth color)
- (rendered patch, ground-truth patch)
The global energy is a sum of weighted squared residuals:
where , , and . The residual vector is defined entrywise as either or . The goal is .
LM proceeds via the Gauss–Newton step with Tikhonov-style damping. The update equation for the -th iteration is:
where is the Jacobian of with respect to , and is a regularization coefficient adapted per iteration via the standard -test.
2. GPU-based Solver Architecture
The implementation is fully matrix-free, leveraging GPU parallelism and explicit CUDA kernels rather than materializing the massive sparse Jacobian matrix. All matrix computations are performed implicitly through custom operations over a “gradient cache”:
- buildCache: For each pixel, forward rasterization is followed by a per-pixel backward pass to store all necessary “pixel-to-splat” () gradients as flat cache entries, along with and the gradient payload.
- diagJTJ: Sums squared gradients per parameter to accumulate diagonal elements of .
- applyJ: Applies to a vector, streaming over the cache (sorted by Gaussian).
- applyJT: Applies , streaming over the cache (sorted by splat).
Cache construction ensures efficient computation and memory coalescing, using sort operations to optimize data layout before each step.
3. Batching, Partial Update Aggregation, and Memory Management
To scale to datasets where the aggregate cache size exceeds GPU memory, the images are divided into batches. For each batch :
- Construct the gradient cache for the batch.
- Accumulate batch-specific and diagonal preconditioner .
- Solve the regularized normal equations using Preconditioned Conjugate Gradient (PCG, see Section 4).
Partial update directions are combined as a weighted mean across batches, weighted by , reflecting each batch’s constraint strength on each Gaussian:
A line search is performed on a subset of images to choose step size for the update . The damping coefficient is then adjusted by the -test.
4. Preconditioned Conjugate Gradient (PCG) in CUDA
Each LM update solves the regularized normal system with PCG, requiring only implicit matrix–vector multiplies. The workflow is as follows:
- Initialize residual , preconditioned vector , search direction .
- For iterations, compute:
- via applyJ
- via applyJT and diagonal scaling
- , and updates to , , , accordingly
All these steps are implemented as specialized GPU kernels, using the prebuilt gradient cache.
5. Computational Complexity and Efficiency
The dominant cost per LM iteration is in the PCG loop:
- Rasterization + cache build: for a batch of images, with average splats per pixel.
- PCG matvecs: Each iteration costs , with three such passes per batch.
- Batch combination: to aggregate updates.
In practice: , –$4$, PCG iterations; LM requires iterations beyond an initial ADAM warmup, whereas ADAM alone requires iterations. This yields approximately faster end-to-end optimization over the original ADAM-only 3DGS pipeline (Höllein et al., 2024).
6. Empirical Benchmarks and Output Quality
On NVIDIA A100 (Tanks & Temples “Train” scene):
- 3DGS + ADAM: 736 s, SSIM 0.844, PSNR 23.68, LPIPS 0.178
- 3DGS + LM: 663 s (), identical SSIM, PSNR, LPIPS
Combining LM with additional system-level speedups (DISTWAR, gsplat, Taming-3DGS) maintains a 30% total speed increase relative to vanilla 3DGS. Output quality remains unchanged according to standard perceptual quality metrics.
7. Key Equations and GPU Code Constructs
The core update equation is the regularized normal equation:
PCG matrix-vector steps are realized as:
- applyJ: For each Gaussian , gaussianParamIndex
- applyJT: For each ray , pixelIndex
- diagJTJ: For parameter ,
CUDA pseudocode for per-pixel cache build:
1 2 3 4 5 6 7 |
// buildCache kernel (per pixel):
for each pixel thread t:
trace splats s1...sK
compute alpha-blend state up to sk
compute dr/dsk
write {pixel=t, splat=sk, grad=dr/dsk} to cache
// sortCache with Thrust/radix sort by splat->gaussian id |
Efficient implementation follows this workflow for matrix-free LM updates, leveraging full GPU parallelism by explicit kernel launches and gradient caching.
For further details on implementation specifics and the complete pipeline, see (Höllein et al., 2024).