Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM

Updated 30 June 2025
  • Dy3DGS-SLAM is a monocular SLAM system that fuses 3D Gaussian Splatting with Bayesian dynamic mask fusion to enable robust mapping and precise trajectory estimation in dynamic scenes.
  • It employs a dynamic mask fusion strategy combining optical flow and depth cues to filter out moving objects, ensuring artifact-free static reconstructions.
  • The system achieves state-of-the-art performance with 17 FPS tracking and over 60% improvement in pose accuracy on real-world datasets using only RGB input.

Dy3DGS-SLAM is a monocular 3D Gaussian Splatting-based Simultaneous Localization and Mapping (SLAM) system explicitly designed for dynamic environments. It is distinguished by three primary innovations: the ability to operate with pure monocular RGB input (no depth sensor), robust dynamic object filtering using a probabilistic fusion of motion and depth cues, and a principled loss design that enables precise tracking and artifact-free dense mapping in real-world scenes characterized by unstructured, unpredictable movement.

1. System Overview and Problem Motivation

Dy3DGS-SLAM directly addresses the limitations of existing SLAM methods rooted in neural radiance fields (NeRF) and 3D Gaussian Splatting, which traditionally exhibit high-fidelity reconstruction in static scenes but fail when confronted with moving objects. Many prior systems rely on RGB-D sensors or predefined semantic priors, restricting their applicability in natural settings with only monocular cameras. Dy3DGS-SLAM extends the SLAM paradigm by introducing a dynamic-aware architecture that achieves state-of-the-art trajectory estimation and mapping quality in dynamic scenes using only monocular RGB input.

Key components include:

  • Monocular RGB Input: Operates using a single camera stream, independent of external depth or semantic detectors.
  • 3D Gaussian Splatting (3DGS): The scene is represented as a collection of anisotropic Gaussians in 3D space, enabling efficient and explicit mapping and rendering.
  • Dynamic Mask Fusion: Dynamic regions are identified via Bayesian fusion of optical flow and learned monocular depth masks, providing accurate dynamic/static segmentation without semantic priors.
  • Principled Loss Functions: Custom motion, photometric, and geometric losses are crafted to mitigate the influence of transient dynamics and enforce scale consistency.

2. Dynamic Mask Fusion: Probabilistic Identification of Motion

A core innovation in Dy3DGS-SLAM is its dynamic mask fusion strategy, which integrates optical flow and monocular depth cues via probabilistic modeling to distinguish dynamic from static image regions.

  • Optical Flow Estimation: A lightweight U-Net is applied to successive RGB frames to generate an optical flow mask FmF_m, highlighting moving regions.
  • Monocular Depth Estimation: An advanced monocular depth network (e.g., DepthAnythingV2) predicts depth, which is processed to form a depth mask DmD_m.
  • Clustering: Pixels identified as dynamic by high optical flow are grouped into clusters (representing moving objects) using k-means clustering:

minμ1,...,μki=1kpCipμi2\min_{\mu_1, ..., \mu_k} \sum_{i=1}^k \sum_{p \in C_i} \|p-\mu_i\|^2

where CiC_i are pixel clusters and μi\mu_i their centers.

  • Bayesian Fusion: The final dynamic mask M^\hat{M} for each cluster is computed via the posterior probability:

P(M(p)=1Dm,Fm)=P(DmM(p))P(FmM(p))P(M(p) = 1 | D_m, F_m) = P(D_m | M(p)) \cdot P(F_m | M(p))

Accepting as dynamic those pixels for which P(M(p)=1Dm,Fm)>TP(M(p)=1 | D_m, F_m) > T, with T=0.95T = 0.95. This operation is repeated for all clusters, yielding a combined dynamic mask robust to noise and capable of multi-object identification.

3. Impact of the Dynamic Mask on Tracking and Geometry

The fused dynamic mask M^\hat{M} has two principal roles:

  • Pose Estimation Filtering: Static regions for tracking are those where Mds(p)=0M_{ds}(p) = 0. The tracking process uses a scale-corrected flow map:

F~=FMdsSn\tilde{F} = F \cdot M_{ds} \cdot S_n

where SnS_n is a depth-derived scale factor. By excluding dynamic pixels, the system prevents erroneous pose updates stemming from moving objects.

  • Geometry Refinement: During mapping, Gaussians associated with dynamic pixels are flagged for special treatment: they are penalized or pruned, preventing the inclusion of transient geometry in the reconstructed static scene.

4. Loss Functions: Motion, Color, and Depth

Dy3DGS-SLAM introduces novel loss formulations to support dynamic environments:

  • Motion Loss for Tracking: A motion loss incorporating scale and mask constraints:

LM=T^max(T^Sn,ε)Tmax(TSn,ε)+(R^R)Mds\mathcal{L}_M = \frac{\hat{T}}{\max(|\hat{T} \cdot S_n|, \varepsilon)} - \frac{T}{\max(|T \cdot S_n|, \varepsilon)} + (\hat{R} - R) \cdot M_{ds}

where T^,T\hat{T}, T are predicted and ground-truth translations, R^,R\hat{R}, R rotations, and SnS_n the scale factor.

  • Multitask Tracking Loss:

LP=λ1LO+λ2LU+LM\mathcal{L}_P = \lambda_1 \mathcal{L}_O + \lambda_2 \mathcal{L}_U + \mathcal{L}_M

with LO\mathcal{L}_O the optical flow loss and LU\mathcal{L}_U the motion segmentation loss.

  • Photometric and Depth Rendering Losses: For mapping, dynamic pixels are isolated and penalized:

Lc=λdNdNpiCkCkgt+λsNpiNdNpiCkCkgtL_c = \lambda_d \cdot \frac{N_d}{N_{pi}}|C_k - C_k^\mathrm{gt}| + \lambda_s \cdot \frac{N_{pi} - N_d}{N_{pi}}|C_k - C_k^\mathrm{gt}|

Ld=λtDdDpiDkDke+λmDpiDdDpiDkDkeL_d = \lambda_t \cdot \frac{D_d}{D_{pi}}|D_k - D_k^e| + \lambda_m \cdot \frac{D_{pi} - D_d}{D_{pi}}|D_k - D_k^e|

Rendered color and depth for static regions are more strongly matched to observations, while dynamic regions are suppressed (e.g., their depth is set to infinity), ensuring the static map remains uncontaminated.

5. Experimental Results and System Properties

  • Datasets: Dy3DGS-SLAM is evaluated on real-world challenging datasets such as TUM RGB-D, AirDOS-Shibuya, and BONN RGB-D, each characterized by significant dynamic content.
  • Accuracy: On the BONN dataset, Dy3DGS-SLAM achieves the lowest absolute trajectory error (ATE RMSE) among all compared systems (average 4.5 cm), despite using only monocular RGB.
  • Efficiency: The pipeline operates at practical speeds (17 FPS tracking), using a single network iteration for pose estimation—substantially improving over baselines requiring multiple iterations.
  • Mapping Quality: Visual results show high-fidelity, artifact-free reconstructions, with dynamic objects (e.g., moving people, hands) absent from the static map. Mask fusion improves tracking ATE by over 60% relative to optical flow only.
  • Robustness: The approach generalizes to multiple, unpredictable moving objects and operates without semantic or depth priors.

6. Distinctive Features in the Landscape of Dynamic SLAM

Compared to predicate methods:

Method Input Dynamic Handling Dynamic Modeling Static Map Quality
DynaSLAM RGB-D Semantic segmentation No High (static only)
Gassidy RGB-D Instance/loss flow Suppress dynamics High (static only)
DynaGSLAM RGB-D Flow-based, GS-based Dynamic GS objects High
Dy3DGS-SLAM RGB (mono) Probabilistic fusion Suppress dynamics High (RGB only)

Dy3DGS-SLAM’s main contribution is dynamic segmentation and suppression with only monocular RGB and no pre-labeled priors, while maintaining superior tracking and mapping performance even compared to RGB-D methods.

7. Challenges, Limitations, and Future Directions

  • Dependence on Mask Accuracy: Residual dynamic pixels due to imperfect depth or flow estimation may occasionally introduce error, though the Bayesian fusion helps mitigate this.
  • Extension to Mobile/Edge: Authors highlight plans to adapt Dy3DGS-SLAM for low-power devices and mobile platforms, improving efficiency further.
  • Generalization Beyond Indoor: While tested on real-world indoor and semi-outdoor datasets, adaptation to large, highly dynamic urban scenes remains an avenue for future work.

Conclusion

Dy3DGS-SLAM establishes a new standard for dense SLAM in dynamic environments, demonstrating that robust, high-quality tracking and mapping can be achieved with only monocular RGB input, advanced dynamic mask fusion, and loss-aware optimization. Its contribution lies in the unification of probabilistic motion-depth sensor fusion, efficient dynamic suppression, and high-fidelity scene reconstruction, making it particularly suitable for robotics, augmented reality, and real-time scene understanding in unconstrained and unpredictable environments.