Dynamic Visual SLAM with 3D Gaussian Prior
- The paper introduces a 3D Gaussian Splatting prior that explicitly models both static and dynamic scene elements to enhance SLAM robustness.
- It integrates motion segmentation and hybrid pose optimization, effectively combining photometric, geometric, and temporal cues.
- Experiments demonstrate reduced ATE and improved rendering quality, highlighting adaptive Gaussian management and joint optimization.
Dynamic Visual SLAM using a General 3D Prior refers to a class of algorithms and systems for Simultaneous Localization and Mapping (SLAM) that leverage an explicit, continuous, and differentiable 3D Gaussian representation as a scene prior to achieve robust camera tracking, mapping, and consistent rendering in environments containing dynamic (i.e., moving) objects. Unlike traditional point-cloud or voxel-based SLAM, which typically fail or degrade when static-scene assumptions are violated, the use of 3D Gaussian Splatting (3DGS) enables the integration of geometric, photometric, and temporal reasoning with explicit modeling of both static and dynamic entities. Important contributions in this area include DG-SLAM (Xu et al., 13 Nov 2024), Dy3DGS-SLAM (Li et al., 6 Jun 2025), and DynaGSLAM (Li et al., 15 Mar 2025), each introducing technical mechanisms for segmentation, representation, and joint optimization in dynamic scenes.
1. 3D Gaussian Splatting as a General Prior
The 3D Gaussian Splatting prior represents a scene as a collection of explicit 3D Gaussian ellipsoids, each parametrized by its center , a covariance matrix encoding its spatial extent, an opacity or density parameter, and appearance coefficients (either RGB or higher-order spherical harmonics for view-dependent color modeling). The scene density at a spatial location is given by:
This prior exhibits key advantages:
- Continuity and Differentiability: The smooth probabilistic field over is amenable to gradient-based optimization for both pose estimation and mapping.
- Expressiveness: View-dependent appearance is encoded via spherical harmonics; anisotropic geometric structure is captured by full covariances.
- Explicitness: Unlike Neural Radiance Fields (NeRF), which are implicit and slow to render, 3DGS supports real-time rasterization and efficient updates.
- Extensibility: The prior supports both static and dynamic components with independent parameter evolution (Xu et al., 13 Nov 2024, Li et al., 6 Jun 2025, Li et al., 15 Mar 2025).
2. Dynamic Object Handling and Segmentation
Handling dynamics requires extension of the prior and the SLAM estimation process:
- Motion Mask Generation (DG-SLAM, Dy3DGS-SLAM): Motion masks are derived from a combination of optical flow, depth, and semantic segmentation. In DG-SLAM, a depth warp and semantic mask are fused across a sliding window, producing a pixel-wise dynamic/static classification that is used to downweight or exclude dynamic pixels during pose and map optimization (Xu et al., 13 Nov 2024). Dy3DGS-SLAM fuses binary masks from flow and monocular depth using a Bayesian approach:
Mask thresholding and clustering yield the final dynamic mask (Li et al., 6 Jun 2025).
- Dynamic/Static 3D Splat Management (DynaGSLAM): Dynamic and static Gaussians are managed in separate arrays. Dynamic object regions are detected via optical flow (e.g., RAFT), refined with segmentation (e.g., SAM2), lifted to 3D, and tracked over time. Separate learning rates and deletion policies are maintained. Cubic Hermite splines are used to interpolate dynamic Gaussian trajectories in continuous time (Li et al., 15 Mar 2025).
3. Adaptive Gaussian Management and Optimization
Accurate representation and computational tractability require dynamic adjustment of the Gaussian set:
- Gaussian Addition: New Gaussians are added via two-stage sampling: uniform sampling in newly observed regions, and error-driven addition where opacity is low or depth errors exceed a threshold.
- Gaussian Pruning: Gaussians are pruned if their accumulated opacity falls below a threshold, their spatial extent becomes degenerate or excessive, or their observation counts are low within a sliding temporal window.
- Implicit Split/Merge: Regions of high error or texture spawn additional Gaussians; low-informative or over-parameterized areas experience pruning, promoting a compact and adaptive representation (Xu et al., 13 Nov 2024).
- Dynamic Gaussian Flow Management: DynaGSLAM updates dynamic objects by matching new 3D motion-segmented points to existing dynamic Gaussians within a distance threshold, reusing or initializing as required and deleting unobserved elements (Li et al., 15 Mar 2025).
4. Pose Estimation and Tracking in Dynamic Scenes
Hybrid and robust pose estimation is central to dynamic SLAM with a 3DGS prior:
- Hybrid Pose Optimization (DG-SLAM): Coarse tracking is conducted via visual odometry (e.g., DROID-SLAM), leveraging flow-corrected correspondences and masking out dynamic pixels in the loss function, followed by fine tracking with differentiable splatting-based objectives. The joint photometric and geometric residuals are minimized with Gauss–Newton updates in (Xu et al., 13 Nov 2024).
- Motion-Constrained Losses (Dy3DGS-SLAM): Once the dynamic mask is obtained, static flows are depth-normalized and pose outputs are supervised with a "motion loss" that penalizes translation and rotation mismatches only on static regions, enabling single-iteration accurate pose prediction (Li et al., 6 Jun 2025).
- Continuous-Time Motion Estimation (DynaGSLAM): The per-splat dynamic motion is modeled with cubic Hermite splines, enabling interpolation and extrapolation of object positions not limited to discrete frames. Ego-motion estimation is handled by a factor-graph–based optimizer robust to dynamic outliers, while dynamic splat parameters are co-optimized with static background (Li et al., 15 Mar 2025).
5. Mapping, Scene Synthesis, and Rendering
SLAM pipelines with a 3DGS prior support both mapping and high-fidelity rendering:
- Mapping Losses: Photometric and depth-consistency losses are computed over rendered keyframes, with dynamic pixels incurring reduced penalties to prevent transients from contaminating the static reconstruction. Gaussians heavily concentrated on dynamic pixels are pruned from the map (Li et al., 6 Jun 2025).
- Rendering: Gaussians are rendered in real time via front-to-back alpha compositing. Real-time forward splatting is implemented via custom GPU kernels, enabling interactive and artifact-free visualization of reconstructed scenes. Both static backgrounds and moving objects can be rendered, and dynamic Gaussians support spatio-temporal interpolation or extrapolation, enhancing predictive modeling capabilities (Li et al., 15 Mar 2025).
6. Experimental Performance and Benchmarks
Recent systems demonstrate state-of-the-art performance in both tracking and mapping:
- DG-SLAM (Xu et al., 13 Nov 2024): On TUM dynamic sequences, average ATE = 2.2 cm, outperforming DROID-VO (3.3 cm). Mapping accuracy on BONN dynamic scenes reaches 8.06 cm with 43.7% completion, outperforming NICE-SLAM, Co-SLAM, and ESLAM. The pipeline runs at ≈650 ms per frame on RTX 3090Ti with real-time rendering above 200 Hz.
- Dy3DGS-SLAM (Li et al., 6 Jun 2025): On BONN dynamic sequences, achieves 4.5 cm ATE RMSE, outperforming all NeRF/3DGS RGB-D baselines and matching/exceeding traditional RGB-D methods (e.g., DynaSLAM: 4.8 cm). Tracking reaches 17 FPS and mapping 430 ms per keyframe. Scene synthesis yields artifact-free static backgrounds with dynamic object removal.
- DynaGSLAM (Li et al., 15 Mar 2025): Measured over multiple dynamic datasets (Bonn, TUM, OMD), DynaGSLAM improves PSNR/SSIM/LPIPS metrics by significant margins over existing static and "anti-dynamic" baselines. ATE is reduced by approximately 30% over static methods. Overall runtime is ∼3 Hz with efficient memory usage (∼2.6 GB).
| System | Dynamic Input | Dynamic Handling | Modalities | ATE/PSNR (Sample) | Runtime |
|---|---|---|---|---|---|
| DG-SLAM | RGB+D (offline/online) | Motion mask + hybrid VO | RGB-D | 2.2 cm (ATE, TUM) | 650 ms/frame |
| Dy3DGS-SLAM | Monocular RGB | Prob. mask + motion loss | RGB | 4.5 cm (ATE, BONN) | 17 FPS track |
| DynaGSLAM | RGB-D | Dynamic GS, splines | RGB-D | PSNR >18 dB, ATE↓ | ∼3 Hz |
7. Strengths, Limitations, and Future Perspectives
Strengths
- Explicit Modeling: 3D Gaussian Splatting provides continuous, differentiable geometry and appearance, explicit support for high-fidelity rendering, and the ability to represent both static and dynamic content (Xu et al., 13 Nov 2024, Li et al., 15 Mar 2025).
- Dynamic Resilience: Motion masks, probabilistic segmentation fusion, and motion-constrained losses enable robust SLAM in highly dynamic environments.
- Joint Optimization: Simultaneous estimation of ego-motion, dynamic object trajectories, and 3D structure supports predictive spatio-temporal modeling, including motion interpolation/extrapolation (Li et al., 15 Mar 2025).
Limitations
- Dependence on Segmentation: Reliability is sensitive to motion and semantic segmentation quality; catastrophic segmentation errors can degrade map quality or trajectory accuracy (Li et al., 15 Mar 2025, Xu et al., 13 Nov 2024).
- Limited Global Consistency: Most methods do not support global loop closure or city-scale optimizations, leading to potential drift in large-scale or repetitive environments (Xu et al., 13 Nov 2024).
- Dynamic Model Simplicity: Current dynamic priors are limited to simple spline trajectories or per-splat velocities; non-rigid and articulated motions are not yet fully addressed (Li et al., 15 Mar 2025).
Future Directions
Proposed directions include integration of loop closure into Gaussian frameworks, self-supervised dynamic object segmentation, scale-up via hierarchical or out-of-core data structures, and learning-based end-to-end dynamic Gaussian management. Incorporating richer motion priors and multi-sensor fusion (IMU, LiDAR) are additional targets for research (Xu et al., 13 Nov 2024, Li et al., 15 Mar 2025).
Dynamic Visual SLAM using a General 3D Prior is thus a rapidly advancing field enabling photorealistic, accurate, and temporally robust mapping and localization in real-world, dynamic scenes by unifying explicit volumetric priors, robust segmentation, and continuous-time motion modeling within efficient, differentiable inference pipelines (Xu et al., 13 Nov 2024, Li et al., 6 Jun 2025, Li et al., 15 Mar 2025).