- The paper introduces a novel dense RGB-D SLAM approach using 2D Gaussian surfels to enhance camera localization and 3D scene reconstruction.
- It details a surface-aware depth rendering method that employs unbiased depth computation, depth adjustment, and opacity normalization to mitigate geometric distortions.
- The system’s integrated local mapping and backend optimization achieve state-of-the-art tracking accuracy on benchmarks like Replica and ScanNet++ with millimeter precision.
GauS-SLAM (2505.01934) is a dense RGB-D SLAM system designed to achieve high-precision camera localization and high-fidelity 3D scene reconstruction. It differentiates itself by leveraging 2D Gaussian surfels as the primary scene representation, addressing key limitations found in prior Gaussian-based SLAM methods that primarily rely on 3D Gaussian splatting.
The paper identifies two main challenges in existing Gaussian-based tracking approaches:
- Geometry Distortion: Gaussian representations, particularly 3D Gaussians using a center depth model, suffer from multi-view inconsistent depth estimations. Interference between surfaces during depth blending further degrades geometry accuracy under novel viewpoints, which negatively impacts tracking accuracy based on frame-to-model alignment.
- Outlier Handling: Standard outlier rejection techniques based on accumulated opacity are insufficient to handle interference regions (e.g., occluded parts) that can still have high opacity, leading to misalignment during tracking.
To address these challenges, GauS-SLAM proposes several technical contributions:
1. 2D Gaussian Surfel Representation:
Unlike isotropic 3D Gaussians, 2D Gaussian surfels are designed to better model surfaces by being attached to tangent planes. A 2D Gaussian primitive is defined by its central point μi, opacity oi, color ci, and a geometry matrix Σi representing the 2D plane defined by two tangential vectors eu,ev and a normal ngt (derived from ground truth or rendered depth). This representation allows for calculating an unbiased depth by directly finding the intersection point of a ray with the 2D Gaussian primitive's plane, offering better geometry accuracy than 3D Gaussians' center depth model.
2. Surface-aware Depth Rendering:
The system introduces a refined depth rendering mechanism to improve geometry accuracy and multi-view consistency:
- Unbiased Depth: Leverages the inherent unbiased intersection depth of 2D Gaussian surfels.
- Depth Adjustment: Modifies the contribution of distant Gaussians during α-blending to mitigate interference. For a Gaussian i along a ray, its depth di is adjusted to di′ using a weighted blend with the median depth dm (depth at which accumulated opacity exceeds 0.5):
di′=βidi+(1−βi)dm. The weight βi is a function of the distance between di and dm, decreasing for Gaussians far from the median depth: βi=exp(−Bσi2(di−dm)2).
- Depth Normalization: Normalizes the weighted depths by the total accumulated opacity A(r) to prevent underestimation caused by incomplete opacity accumulation: D(r)=A(r)∑i=1ndi′wi.
3. Camera Tracking:
GauS-SLAM employs a frame-to-model tracking approach within a local map. The camera pose {R,t} for the current frame is optimized to minimize a combined loss function comparing rendered color and depth maps (I^,D^) with the observed ones (I,D). The loss is applied only to pixels with high accumulated opacity (A>0.9) to reduce the impact of newly observed or outlier regions. The optimization is formulated as minimizing the discrepancy between the rendered scene from the current pose and the actual observation:
$\mathcal{L}_{track}=(A>0.9)\left(
\mathcal{L}_1(D,\hat{D}) +
\lambda_1 \mathcal{L}_1(I,\hat{I})\right)$Pose initialization uses a constant velocity model. The system uses Adam optimizer with specific parameters (β1=0.7,β2=0.99) for tracking optimization.
4. Incremental Mapping:
The system initializes new 2D Gaussian surfels using a "Surfel Attachment" strategy, inspired by SplaTAM (Patlagan et al., 4 Mar 2024). New Gaussians are placed at the unprojected 3D points corresponding to pixels with low accumulated opacity (A<0.6). Their initial scale and orientation are derived from the pixel's depth and estimated normal. For pixels with partial coverage ($0.4 < A < 0.6$), rendered depth is used for initialization ("Edge Growth"), allowing the model to expand into partially reconstructed areas.
Mapping optimizes the parameters of the Gaussian model within the local map using a loss function combining color, depth, and a regularization term:
Lmap=L1(D,D^)+λ1L1(I,I^)+λ2Lreg.
The regularization loss Lreg=r∑i=1∑wi(di′−dm)2 minimizes the variance of adjusted depths along rays, reducing depth uncertainty.
5. System Architecture (Front-end/Back-end):
GauS-SLAM utilizes a decoupled front-end and back-end processing pipeline:
- Front-end: Responsible for real-time camera tracking and incremental mapping using a "local map". The first frame of a new local map is the Reference Keyframe (RKF). New frames are tracked against the local map. Frames are selected as Keyframes (KFs) if the newly observed scene exceeds a threshold (τk=1%). Mapping is performed on KFs. When the local map exceeds a size threshold (τl=1.5⋅H⋅W), it's sent to the back-end, and a new local map is initialized with the current frame as the new RKF.
- Back-end: Processes local maps asynchronously. It merges the local Gaussian map into the global map (resetting opacities of merged Gaussians). Submaps (frames from local maps) are stored, and co-visibility between submaps is determined (using NetVLAD features). Bundle Adjustment (BA) is performed on the poses of RKFs of co-visible submaps and the global map, minimizing a loss similar to the mapping loss on randomly selected frames. Gaussians with low opacity (<0.05) are pruned after mapping. "Random Optimization" refines the global map using stochastically selected frames from submaps to combat forgetting. "Final Refinement" runs Random Optimization for an extended period after the main loop finishes.
Implementation Details:
- Hyperparameters for learning rates and iteration counts are adjusted per dataset (e.g., higher iterations for challenging sequences).
- Exposure variation is handled with a simple linear compensation I′=aI+b with learnable coefficients a,b per frame.
- A re-tracking strategy is employed if tracking is lost (e.g., high depth error), resetting the local map and designating the lost frame as a new RKF, potentially leveraging a broader set of submaps.
- Mesh reconstruction for evaluation is done using TSDF Fusion (Harland-Lang et al., 2017).
- Experiments were conducted on an Intel Core i9-14900K with an NVIDIA GeForce A6000 GPU.
Practical Implications and Performance:
GauS-SLAM demonstrates strong performance on standard RGB-D SLAM benchmarks (Replica (Straub et al., 2019), TUM-RGBD [sturm12iros], ScanNet (Dai et al., 2017), ScanNet++ [scannetpp]). It achieves state-of-the-art tracking accuracy on Replica (0.06 cm ATE-RMSE) and ScanNet++ (millimeter-level accuracy), outperforming comparable Gaussian-based methods like SplaTAM (Patlagan et al., 4 Mar 2024) and even some methods with loop closure on certain sequences. The use of 2D Gaussian surfels and the surface-aware depth rendering lead to improved geometry accuracy (lower Depth L1, higher F1-Score) and rendering quality (higher PSNR, SSIM, lower LPIPS) compared to most 3D Gaussian methods.
The local map design is shown to be crucial for maintaining tracking accuracy and efficiency, particularly in scenarios with object-circling camera motion where global maps can contain interference regions. Periodically resetting the local map prevents the degradation of tracking/mapping efficiency as the global map grows. The ablation studies confirm the importance of unbiased depth, depth normalization, and depth adjustment for rendering/tracking performance, and the contributions of keyframes, local mapping, and backend optimizations to overall system accuracy and efficiency.
Limitations:
The system's performance can degrade on datasets with significant motion blur and exposure variations (like TUM-RGBD and ScanNet), suggesting sensitivity to these factors that cause multi-view inconsistency.
Future Work:
The authors plan to enhance robustness to challenging conditions like motion blur and exposure variations.