SplaTAM: Volumetric 3D Gaussian SLAM
- SplaTAM is an explicit volumetric SLAM system that employs weighted 3D Gaussian ellipsoids to achieve dense RGB-D reconstruction and efficient online tracking.
- It leverages a differentiable rendering pipeline with joint photometric and geometric residuals, using analytic Jacobians to optimize camera pose and refine map parameters.
- The system outperforms traditional approaches by offering real-time performance, reduced memory usage, and improved reconstruction accuracy compared to volumetric grids and NeRF-based SLAM methods.
SplaTAM is an explicit volumetric SLAM (Simultaneous Localization and Mapping) system that utilizes weighted 3D Gaussian ellipsoids as map primitives for dense RGB-D reconstruction and online tracking. By integrating 3D Gaussian splatting into SLAM, it enables fast, high-fidelity scene modeling from a single unposed RGB-D camera. SplaTAM combines a differentiable rendering pipeline, online nonlinear camera tracking, and a memory-efficient map update strategy. Major motivations for SplaTAM include surpassing the fidelity, efficiency, and generality of previous dense SLAM paradigms such as volumetric grids, point clouds, and implicit neural fields.
1. Map Representation with Weighted 3D Gaussians
The central data structure in SplaTAM is a map where each map element is a 3D Gaussian defined by:
- Mean position
- Covariance (symmetric, positive definite)
- Color or radiance
- Opacity
The scene's volumetric density at point is modeled as: Summing over all Gaussians, the global density is .
This continuous, differentiable scene model enables analytic computation of both photometric renderings and spatial gradients needed for optimization.
2. Volumetric Splatting and Differentiable Rendering
To project the 3D Gaussian model onto the camera image:
- Each Gaussian is transformed to camera coordinates using the current pose : .
- The mean is projected by to the image.
- The covariance translates to a 2D elliptical footprint via , where is the projection Jacobian.
Rendering proceeds by per-pixel accumulation: with the front-to-back compositing: A silhouette mask if () and 0 otherwise, serves to identify newly observed or unmapped pixels.
This rendering strategy supports analytic derivatives with respect to both camera pose and all Gaussian parameters, which is critical for joint optimization within SLAM.
3. Online Tracking with Joint Photometric and Geometric Residuals
Camera pose refinement at each step solves the following optimization:
- For each RGB-D input frame , residuals are evaluated over the visible domain :
- Photometric:
- Geometric: , where is rendered depth
- Aggregate loss: Optimization proceeds via Gauss–Newton or Levenberg–Marquardt, leveraging analytic Jacobians: The closed-form Jacobians for the SE(3) pose and image projection yield highly effective, data-driven camera localization from raw measurements.
4. Incremental Map Building: Densification, Refinement, Culling
The 3D Gaussian map is dynamically updated to reflect new observations:
- Densification: For pixels where ( median depth error), a new Gaussian is initialized at camera-space point , with small isotropic covariance, color from , and nominal opacity.
- Parameter Refinement: Periodically, all (or a sliding window of) Gaussian parameters are jointly optimized to minimize accumulated photometric error across recent frames, via analytic gradients.
- Culling and Splitting: Gaussians with opacity or accumulated rendering weight below threshold are pruned, while large/anisotropic Gaussians may be split to improve reconstructions of thin structures.
Such adaptivity yields a compact representation, with only a few thousand Gaussians required versus millions of voxels or MLP weights in prior systems.
5. Computational Performance and Experimental Results
SplaTAM demonstrates performance benchmarks that emphasize its efficiency and fidelity:
- Absolute Trajectory Error (ATE) on ScanNet++: $0.55$ cm (surpassing methods with ATE cm).
- Reconstruction Novel-View PSNR: $28.1$ dB (train views), $23.99$ dB (novel views)—~ improvement on depth and RGB accuracy versus NeRF-based SLAMs.
- Runtime: Real-time execution on a single GPU, ms per frame for joint tracking and rendering.
- Memory usage: Approximately of a NeRF-SLAM system; SplaTAM uses only thousands of Gaussians.
Pseudocode for an update iteration is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Input: RGB-D frame (I_t, D_t), previous map M, pose T_{t−1}
1. Initialize T_t ← T_{t−1}
2. for iter = 1..N_track_iters do
Render C(p), Ŷ(p), mask M(p) via splatting with T_t
Compute residuals r^c_p, r^d_p over Ω_vis
Build JTJ, JTr and solve Δξ
Update T_t ← exp(Δξ) · T_t
end for
3. Densification:
For each pixel p with D_t(p) valid:
if D_t(p) < Ŷ(p) − τ_d:
Add Gaussian at μ_new, Σ_0, c_new, α_0
4. If t mod K == 0:
Optimize map parameters {μ_i,Σ_i,c_i,α_i} over last W frames
5. Cull low‐weight Gaussians, optionally split large ones
Output: pose T_t, updated map M |
6. Comparative Advantages and Relationship to Prior Work
SplaTAM advances dense SLAM through:
- Continuous, closed-form differentiable mapping: Allows direct analytic computation of pose and map parameter updates, in contrast to sampled point-based methods or implicit MLP fields.
- Efficient alpha-composited splatting: Enables real-time rendering and optimization at full image resolution.
- Adaptive, event-driven map allocation: New Gaussians are instantiated only for previously unmapped or newly observed regions, extending mapping coverage dynamically without superfluous resource use.
Compared to 2D Gaussian surfel approaches (e.g., LAM (Fan et al., 28 Jul 2025)), which focus on explicit surface orientation and radial Jacobians, SplaTAM's explicit volumetric density supports robust tracking but may lack certain geometrically accurate rotational cues provided by surfel normals. SplaTAM exceeds point-based and NeRF-based SLAM systems in both compactness and photometric/geometric accuracy under standard benchmarks.
This suggests that explicit, volumetric, differentiable 3D Gaussian representation—coupled with online analytic pose and map updates—constitutes a state-of-the-art foundation for dense, real-time RGB-D SLAM from a single camera.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free