SGAD-SLAM: Depth-Adjusted RGB-D SLAM
- The paper introduces a novel formulation using per-pixel, depth-adjustable 3D Gaussians to balance flexibility, accuracy, and efficiency in SLAM applications.
- It employs an end-to-end differentiable rendering pipeline with a mapping loss that integrates RGB fidelity, SSIM, and depth consistency for robust optimization.
- Selective Gaussian storage across current and neighbor frames leads to competitive tracking accuracy and rapid mapping while reducing memory demands.
SGAD-SLAM is a RGB-D SLAM (Simultaneous Localization and Mapping) system that introduces a novel formulation for scene representation and tracking based on pixel-aligned 3D Gaussian splatting with learnable depth adjustments. In contrast to traditional approaches using either unconstrained 3D Gaussians or over-constrained view-tied 3D Gaussians to represent radiance fields, SGAD-SLAM proposes per-pixel, depth-adjustable Gaussians to balance flexibility, accuracy, and computational efficiency in radiance field mapping and camera tracking (Hu et al., 22 Mar 2026).
1. System Architecture and Radiance Field Representation
SGAD-SLAM represents each RGB-D camera frame as a set of pixel-aligned 3D Gaussians. For pixel in frame , a Gaussian is defined with the following attributes:
- : per-pixel color (RGB radiance)
- : standard deviation ("radius")
- : opacity (peak density)
- : learnable depth offset along the viewing ray
The Gaussian center in world coordinates is
where is the camera translation, the observed depth at pixel 0, and 1 the world-space ray direction. The absolute value ensures center positions remain physically valid.
The resulting density and radiance fields are:
- Density:
2
- View-independent Radiance:
3
Only the Gaussians of the current frame and a small window of neighbor frames are stored and optimized, which restricts GPU memory usage and computation.
2. Depth Adjustment Mechanism and Mapping Optimization
The pixel-aligned Gaussians in SGAD-SLAM include a depth offset 4. During mapping, these offsets are optimized to adjust the center positions along each pixel’s viewing ray, allowing precise radiance field reconstruction without unconstrained 3D Gaussian motions.
The mapping loss,
5
uses rendered vs. ground-truth RGB (6), a structural similarity measure (SSIM), and rendered vs. observed depth (7), with 8 masking invalid depth. The differentiable splatting operator generates the rendered images for optimization. Gradient computations for 9 propagate through the chain rule applied to the rendering pipeline.
3. Rendering Pipeline
SGAD-SLAM employs volumetric ray splatting for rendering. For a camera ray 0, the color is computed as:
1
where 2 is accumulated transmittance and 3 is the "thickness" of the Gaussian. The depth image is rendered as the expected integration of depth values, using the same transmittance weighting.
This enables end-to-end differentiable rendering necessary for the joint optimization of radiance and depth parameters.
4. Frame-to-Map Tracking via Gaussian Geometry and GICP
For each frame, geometry extraction and tracking proceed as follows:
- Uniformly downsampled depth pixels are back-projected to 3D positions 4.
- Local geometry about each point is modeled using the covariance of 5 (typically 10) nearest 3D points, forming a geometry Gaussian 6.
- A global map 7 accumulates non-overlapping geometry Gaussians from previous frames.
- Frame alignment is posed as a Generalized ICP optimization:
8
This is minimized with respect to the rigid transformation 9 of the current frame pose, using alternating correspondence assignment and closed-form updates.
Residuals are also formulated for point-to-plane minimization within the Levenberg–Marquardt scheme.
5. System Design Choices and Scalability
SGAD-SLAM leverages several design principles to enhance scalability:
- Spherical Gaussians (1 radius) are used, avoiding the larger parameter count of ellipsoidal Gaussians.
- Only current and nearest-neighbor frame Gaussians are actively stored and optimized; no global dense map is held in GPU memory.
- Radii are scale-normalized across frames for robust field matching.
- The exclusion of local densification and rotation attributes increases computational throughput.
The result is a system with significantly lower memory and computation demands compared to methods using scene-wide Gaussian fields.
| Gaussian Attributes | SGAD-SLAM | Full Ellipsoid |
|---|---|---|
| Radius | 1 | 3 |
| Opacity | 1 | 1 |
| Color | 3 | 3 |
| Depth Offset | 1 | 0 |
| Rotation | 0 | 3 |
| Total | 6 | 14 |
6. Experimental Evaluation
SGAD-SLAM has been evaluated on the Replica, TUM-RGBD, ScanNet, and ScanNet++ datasets using standard metrics: Absolute Trajectory Error (ATE RMSE, cm) for tracking and PSNR, SSIM, LPIPS for radiance field accuracy; Depth L1 and F1 for mesh reconstruction.
On Replica (8 scenes): ATE RMSE was 0.16 cm, matching GS-ICP and bettering NICE-SLAM (1.95 cm) and VTGS-SLAM (0.28 cm). Mapping quality: PSNR 44.87 dB, SSIM 0.998, LPIPS 0.021.
On TUM-RGBD (3 scenes): ATE 2.0 cm; PSNR 38.60 dB, SSIM 0.997, outperforming VTGS on radiance quality. On ScanNet: ATE 7.9 cm, PSNR 42.31 dB, SSIM 0.997.
In terms of efficiency on Replica: tracking time per frame is 0.01 s, mapping 0.89 s; overall 0.90 s. Compared to NICE-SLAM (2.21 s), GS-ICP (1.05 s), and SplaTAM (7.59 s). Scene-wide, 326 million Gaussians are used, with 0.816 million learnable Gaussians per frame.
7. Implementation and Optimization Details
Key configuration parameters include:
- Gaussians per frame 0 (e.g., 4801640 2 307K), downsampled by 3 for tracking (480K).
- Number of neighbors for covariance: 5; neighbor frames 6.
- Mapping: Adam optimizer, learning rate 7 (decreased to 8 after 100 iterations), 200 mapping iterations per frame.
- Tracking: 10 iterations per frame.
- Hardware: NVIDIA A100; mapping one scene in 912 min using 8 GPUs (frame time 0 s); single GPU frame rate 11.1 Hz, 8 GPUs at 26 Hz.
A succinct pseudocode for the main workflow is as follows: 3
8. Context and Implications
SGAD-SLAM’s adoption of depth-adjustable, pixel-aligned 3D Gaussians introduces an effective trade-off between geometric flexibility (for high-quality radiance field estimation) and computational tractability (scalability, memory usage, and speed). The empirical results demonstrate competitive or superior accuracy in both tracking and mapping compared to previous SLAM systems employing deep implicit or explicit geometric models, such as NICE-SLAM, GS-ICP, VTGS, and SplaTAM. The system architecture supports efficient deployment for large-scale and real-time scenarios using commodity GPUs.
Potential implications include adaptation to relocalization, loop closure, or semantic mapping tasks, as well as extension to higher-level scene understanding, as the underlying representation naturally supports dense, differentiable, and photometrically meaningful radiance field construction (Hu et al., 22 Mar 2026).