Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM

Updated 30 June 2025

Dy3DGS-SLAM is a monocular SLAM system that fuses 3D Gaussian Splatting with Bayesian dynamic mask fusion to enable robust mapping and precise trajectory estimation in dynamic scenes.
It employs a dynamic mask fusion strategy combining optical flow and depth cues to filter out moving objects, ensuring artifact-free static reconstructions.
The system achieves state-of-the-art performance with 17 FPS tracking and over 60% improvement in pose accuracy on real-world datasets using only RGB input.

Dy3DGS-SLAM is a monocular 3D Gaussian Splatting-based Simultaneous Localization and Mapping (SLAM) system explicitly designed for dynamic environments. It is distinguished by three primary innovations: the ability to operate with pure monocular RGB input (no depth sensor), robust dynamic object filtering using a probabilistic fusion of motion and depth cues, and a principled loss design that enables precise tracking and artifact-free dense mapping in real-world scenes characterized by unstructured, unpredictable movement.

1. System Overview and Problem Motivation

Dy3DGS-SLAM directly addresses the limitations of existing SLAM methods rooted in neural radiance fields (NeRF) and 3D Gaussian Splatting, which traditionally exhibit high-fidelity reconstruction in static scenes but fail when confronted with moving objects. Many prior systems rely on RGB-D sensors or predefined semantic priors, restricting their applicability in natural settings with only monocular cameras. Dy3DGS-SLAM extends the SLAM paradigm by introducing a dynamic-aware architecture that achieves state-of-the-art trajectory estimation and mapping quality in dynamic scenes using only monocular RGB input.

Key components include:

Monocular RGB Input: Operates using a single camera stream, independent of external depth or semantic detectors.
3D Gaussian Splatting (3DGS): The scene is represented as a collection of anisotropic Gaussians in 3D space, enabling efficient and explicit mapping and rendering.
Dynamic Mask Fusion: Dynamic regions are identified via Bayesian fusion of optical flow and learned monocular depth masks, providing accurate dynamic/static segmentation without semantic priors.
Principled Loss Functions: Custom motion, photometric, and geometric losses are crafted to mitigate the influence of transient dynamics and enforce scale consistency.

2. Dynamic Mask Fusion: Probabilistic Identification of Motion

A core innovation in Dy3DGS-SLAM is its dynamic mask fusion strategy, which integrates optical flow and monocular depth cues via probabilistic modeling to distinguish dynamic from static image regions.

Optical Flow Estimation: A lightweight U-Net is applied to successive RGB frames to generate an optical flow mask $F_m$ , highlighting moving regions.
Monocular Depth Estimation: An advanced monocular depth network (e.g., DepthAnythingV2) predicts depth, which is processed to form a depth mask $D_m$ .
Clustering: Pixels identified as dynamic by high optical flow are grouped into clusters (representing moving objects) using k-means clustering:

$\min_{\mu_1, ..., \mu_k} \sum_{i=1}^k \sum_{p \in C_i} \|p-\mu_i\|^2$

where $C_i$ are pixel clusters and $\mu_i$ their centers.

Bayesian Fusion: The final dynamic mask $\hat{M}$ for each cluster is computed via the posterior probability:

$P(M(p) = 1 | D_m, F_m) = P(D_m | M(p)) \cdot P(F_m | M(p))$

Accepting as dynamic those pixels for which $P(M(p)=1 | D_m, F_m) > T$ , with $T = 0.95$ . This operation is repeated for all clusters, yielding a combined dynamic mask robust to noise and capable of multi-object identification.

3. Impact of the Dynamic Mask on Tracking and Geometry

The fused dynamic mask $\hat{M}$ has two principal roles:

Pose Estimation Filtering: Static regions for tracking are those where $M_{ds}(p) = 0$ . The tracking process uses a scale-corrected flow map:

$\tilde{F} = F \cdot M_{ds} \cdot S_n$

where $S_n$ is a depth-derived scale factor. By excluding dynamic pixels, the system prevents erroneous pose updates stemming from moving objects.

Geometry Refinement: During mapping, Gaussians associated with dynamic pixels are flagged for special treatment: they are penalized or pruned, preventing the inclusion of transient geometry in the reconstructed static scene.

4. Loss Functions: Motion, Color, and Depth

Dy3DGS-SLAM introduces novel loss formulations to support dynamic environments:

Motion Loss for Tracking: A motion loss incorporating scale and mask constraints:

$\mathcal{L}_M = \frac{\hat{T}}{\max(|\hat{T} \cdot S_n|, \varepsilon)} - \frac{T}{\max(|T \cdot S_n|, \varepsilon)} + (\hat{R} - R) \cdot M_{ds}$

where $\hat{T}, T$ are predicted and ground-truth translations, $\hat{R}, R$ rotations, and $S_n$ the scale factor.

Multitask Tracking Loss:

$\mathcal{L}_P = \lambda_1 \mathcal{L}_O + \lambda_2 \mathcal{L}_U + \mathcal{L}_M$

with $\mathcal{L}_O$ the optical flow loss and $\mathcal{L}_U$ the motion segmentation loss.

Photometric and Depth Rendering Losses: For mapping, dynamic pixels are isolated and penalized:

$L_c = \lambda_d \cdot \frac{N_d}{N_{pi}}|C_k - C_k^\mathrm{gt}| + \lambda_s \cdot \frac{N_{pi} - N_d}{N_{pi}}|C_k - C_k^\mathrm{gt}|$

$L_d = \lambda_t \cdot \frac{D_d}{D_{pi}}|D_k - D_k^e| + \lambda_m \cdot \frac{D_{pi} - D_d}{D_{pi}}|D_k - D_k^e|$

Rendered color and depth for static regions are more strongly matched to observations, while dynamic regions are suppressed (e.g., their depth is set to infinity), ensuring the static map remains uncontaminated.

5. Experimental Results and System Properties

Datasets: Dy3DGS-SLAM is evaluated on real-world challenging datasets such as TUM RGB-D, AirDOS-Shibuya, and BONN RGB-D, each characterized by significant dynamic content.
Accuracy: On the BONN dataset, Dy3DGS-SLAM achieves the lowest absolute trajectory error (ATE RMSE) among all compared systems (average 4.5 cm), despite using only monocular RGB.
Efficiency: The pipeline operates at practical speeds (17 FPS tracking), using a single network iteration for pose estimation—substantially improving over baselines requiring multiple iterations.
Mapping Quality: Visual results show high-fidelity, artifact-free reconstructions, with dynamic objects (e.g., moving people, hands) absent from the static map. Mask fusion improves tracking ATE by over 60% relative to optical flow only.
Robustness: The approach generalizes to multiple, unpredictable moving objects and operates without semantic or depth priors.

6. Distinctive Features in the Landscape of Dynamic SLAM

Compared to predicate methods:

Method	Input	Dynamic Handling	Dynamic Modeling	Static Map Quality
DynaSLAM	RGB-D	Semantic segmentation	No	High (static only)
Gassidy	RGB-D	Instance/loss flow	Suppress dynamics	High (static only)
DynaGSLAM	RGB-D	Flow-based, GS-based	Dynamic GS objects	High
Dy3DGS-SLAM	RGB (mono)	Probabilistic fusion	Suppress dynamics	High (RGB only)

Dy3DGS-SLAM’s main contribution is dynamic segmentation and suppression with only monocular RGB and no pre-labeled priors, while maintaining superior tracking and mapping performance even compared to RGB-D methods.

7. Challenges, Limitations, and Future Directions

Dependence on Mask Accuracy: Residual dynamic pixels due to imperfect depth or flow estimation may occasionally introduce error, though the Bayesian fusion helps mitigate this.
Extension to Mobile/Edge: Authors highlight plans to adapt Dy3DGS-SLAM for low-power devices and mobile platforms, improving efficiency further.
Generalization Beyond Indoor: While tested on real-world indoor and semi-outdoor datasets, adaptation to large, highly dynamic urban scenes remains an avenue for future work.

Conclusion

Dy3DGS-SLAM establishes a new standard for dense SLAM in dynamic environments, demonstrating that robust, high-quality tracking and mapping can be achieved with only monocular RGB input, advanced dynamic mask fusion, and loss-aware optimization. Its contribution lies in the unification of probabilistic motion-depth sensor fusion, efficient dynamic suppression, and high-fidelity scene reconstruction, making it particularly suitable for robotics, augmented reality, and real-time scene understanding in unconstrained and unpredictable environments.

PDF Markdown Chat (Upgrade)