Dynamic Object SLAM
- Dynamic Object SLAM is a method that models moving objects by integrating semantic segmentation with geometric techniques to enhance camera and object trajectory estimation.
- The approach leverages joint optimization frameworks and multi-modal sensor data to accurately reconstruct 4D scenes and manage dynamic elements within the environment.
- Applications include autonomous driving, mobile robotics, and augmented reality, where real-time dynamic object tracking and robust mapping are critical.
Dynamic object SLAM refers to a class of simultaneous localization and mapping (SLAM) methodologies that explicitly model, detect, track, and/or leverage moving objects in the environment, as opposed to filtering such data as outliers or adhering to a strict static-world assumption. These systems are architected to achieve robust camera (or vehicle) pose estimation and mapping while maintaining accurate representations of dynamic scene elements. Recent advancements in dynamic object SLAM span a diverse set of sensor modalities, architectural assumptions, data association strategies, and optimization frameworks, as evidenced by the literature.
1. Problem Definition and Historical Context
Traditional SLAM algorithms—whether visual, LiDAR, RGB-D, or multi-modal—have generally operated under the static-world assumption, treating moving objects as error sources to be filtered or rejected. This paradigm restricts applicability in domains such as autonomous driving, robotics in populated areas, and augmented reality, where dynamic content is prevalent and often critical to task completion.
Dynamic object SLAM extends the standard framework by modeling the states (pose, shape, motion) of moving objects, enabling simultaneous trajectory estimation for both the sensor platform and dynamic agents. Early works either masked out suspected dynamic regions using heuristics or instance segmentation outputs, while contemporary strategies employ joint optimization, motion and rigidity constraints, dense volumetric modeling, and explicit motion prediction (Yang et al., 2018, Strecke et al., 2019, Zhang et al., 2020, Qiu et al., 2021).
2. Core Methodologies and System Architectures
Dynamic object SLAM approaches can be categorized along several axes:
2.1 Semantic and Geometric Integration
Many contemporary systems combine deep learning–based instance/semantic segmentation (e.g., with Mask R-CNN, SOLOv2, YolactEdge) with classical geometric techniques (e.g., multi-view triangulation, rigid-body motion segmentation). Semantic modules provide pixel-wise dynamic object masks, while geometric clustering (e.g., HDBSCAN, Euclidean clustering in LiDAR) or motion segmentation (optical/depth flow, planar segmentation) provides redundancy against segmentation imperfections and enables recovery when semantic predictions are ambiguous (Wang et al., 2022, Krishna et al., 2023).
2.2 Data Association and Motion Segmentation
Feature correspondence and association of observations over time—crucial for tracking dynamic objects—are handled via:
- Optical flow and scene flow for dense short-term point associations (Zhang et al., 2020, Wadud et al., 2022)
- Multi-model motion segmentation (e.g., labeling feature tracks by residual consistency to parametric ego and object motion models) (Wang et al., 2020)
- Sliding window data association using historical trajectories and polynomial fitting (Tian et al., 2022)
- Probabilistic data association using soft assignment likelihoods in an EM framework (Strecke et al., 2019)
Table 1: Representative Data Association Techniques
Approach | Data Association Method | Modality |
---|---|---|
EM-Fusion (Strecke et al., 2019) | EM soft assignment via pixel likelihoods | RGB-D |
DymSLAM (Wang et al., 2020) | Multi-model geometric residual clustering | Stereo vision |
DL-SLOT (Tian et al., 2022) | Trajectory prediction + assignment via polynomial fitting | LiDAR |
VDO-SLAM (Zhang et al., 2020) | Optical flow-based dense feature association | Monocular/RGB-D |
2.3 Object Motion and Representation Models
Dynamic objects are represented using various models:
- Rigid cuboid parameterization with associated motion models (e.g., nonholonomic vehicle constraints in CubeSLAM (Yang et al., 2018)).
- Dense volumetric representations (TSDF/SDF) for object-level reconstructions (Strecke et al., 2019).
- Explicit Gaussian splatting with time-varying means for online rendering and motion prediction of dynamic splats (Li et al., 15 Mar 2025, Li et al., 6 Jun 2025, Liu et al., 31 Aug 2025).
- Articulated object models imposing rigidity and motion constraints among body parts (AirDOS (Qiu et al., 2021)).
- Probabilistic mask fusion from optical flow and monocular depth for dynamic identification with monocular input (Li et al., 6 Jun 2025).
3. Joint Optimization and Backend Architectures
Dynamic object SLAM systems generally employ a joint optimization or bundle adjustment backend where the state vector aggregates:
- Camera (or ego-platform) poses
- Static 3D feature points and/or background map structures
- Dynamic object poses (typically SE(3) trajectories), shapes, and associated dynamic point features
Objective functions are constructed as the sum of measurement and regularization terms such as:
Specialized loss terms regularize object motions, favor rigidity between parts, encourage feature points to remain inside object boundaries, and constrain the temporal evolution of dynamic elements (Yang et al., 2018, Qiu et al., 2021, Li et al., 6 Jun 2025). When applied to explicit representations like Gaussian splats, color and depth rendering losses are combined and weighted differently for static and dynamic map components to suppress transient interference and occlusion artifacts (Li et al., 6 Jun 2025, Liu et al., 31 Aug 2025).
4. Treatment of Dynamic Features: Robustness Strategies
Handling dynamic observations is central. The literature demonstrates approaches that either:
- Explicitly track and model dynamics, preserving dynamic points in the optimization by assigning them to object frames and enforcing inter-frame motion consistency (Yang et al., 2018, Zhang et al., 2020, Wadud et al., 2022, Li et al., 15 Mar 2025).
- Remove or inpaint dynamic regions pre-SLAM, e.g., via deep video inpainting guided by flow-based masks, and then apply static SLAM on cleaned frames (Uppala et al., 2023, Habibpour et al., 2 Oct 2025).
- Combine adaptive feature extraction, mask refinement via prior information (e.g., through recursive static background models or morphological corrections), and dynamic sampling to maintain optimization constraints despite the exclusion of dynamic regions (Liu et al., 31 Aug 2025).
These strategies yield a spectrum of solutions from full joint modeling to preemptive filtering, depending on task requirements and available computational resources.
5. Performance Evaluation and Benchmarks
Evaluation metrics vary according to the scope of dynamic handling:
- Camera trajectory error (ATE, RMSE, RPE) in dynamic vs. static scenes (Yang et al., 2018, Krishna et al., 2023)
- Object pose, trajectory, and velocity estimation accuracy (Henein et al., 2020, Wadud et al., 2022, Zhang et al., 2020)
- Dense map quality (e.g., Intersection over Union, DynaPSNR, SSIM, LPIPS for rendered views) (Li et al., 15 Mar 2025, Liu et al., 31 Aug 2025)
- Object segmentation recall and mAP (mean Average Precision) where segmentation modules are benchmarked (Wang et al., 2022, Krishna et al., 2023)
- Computational performance and real-time suitability (inference and mapping FPS) (Habibpour et al., 2 Oct 2025, Liu et al., 31 Aug 2025)
Strong empirical results are reported on benchmark datasets such as KITTI, TUM RGB-D, BONN RGB-D, and indoor environments with large dynamic occlusions (Yang et al., 2018, Krishna et al., 2023, Li et al., 6 Jun 2025, Liu et al., 31 Aug 2025). Recent systems demonstrate both robust camera tracking and high-fidelity map reconstruction—dynamic objects are either clearly distinguished from static parts or their motion and structure are estimated for 3D scene understanding and prediction (Li et al., 15 Mar 2025, Li et al., 6 Jun 2025).
6. Applications, Implications, and Open Directions
Dynamic object SLAM systems have enabled:
- Robust navigation and obstacle avoidance by mobile robots/vehicles in crowded urban, indoor, and warehouse environments (Yang et al., 2018, Pfreundschuh et al., 2021, Tian et al., 2022)
- Photorealistic scene digitization and map editing in augmented/virtual reality, supporting dynamic content (Liu et al., 31 Aug 2025, Li et al., 15 Mar 2025)
- Accurate, online tracking and velocity estimation for agents in autonomous driving and surveillance (Henein et al., 2020, Tian et al., 2022)
- 4D (3D + time) scene reconstruction and advanced multi-agent interaction (Wang et al., 2020, Qiu et al., 2021)
Continued research emphasizes:
- Advancing dynamic object representations (e.g., from rigid bodies to articulated models to learning-based non-rigid structures) (Qiu et al., 2021)
- More sophisticated mask fusion, uncertainty modeling, and embedding of dynamic predictions within joint optimization backends (Li et al., 6 Jun 2025, Li et al., 15 Mar 2025)
- Improved computational scaling (GPU/CPU balance, real-time operation) and robustness to imperfect 2D/3D segmentation (Liu et al., 31 Aug 2025, Habibpour et al., 2 Oct 2025)
- Tight coupling between map rendering, object motion tracking, and semantic understanding for comprehensive scene modeling (Li et al., 15 Mar 2025, Krishna et al., 2023)
7. Summary Table of Representative Approaches
System | Dynamic Object Modeling | Map Representation | Sensing Modality | Joint Optimization | Real-Time |
---|---|---|---|---|---|
CubeSLAM (Yang et al., 2018) | Cuboid + motion model | Sparse/cuboid map | Mono camera | Yes | Yes |
EM-Fusion (Strecke et al., 2019) | TSDF with EM data association | Dense SDF volumes | RGB-D | Yes | No |
DymSLAM (Wang et al., 2020) | Geometric motion segmentation | Dense stereo + 4D map | Stereo camera | Yes | Yes |
VDO-SLAM (Zhang et al., 2020) | SE(3) pose for objects via scene flow | Spatiotemporal map | Mono/RGB-D | Yes | Yes |
DL-SLOT (Tian et al., 2022) | Sliding window graph for all objects | LiDAR pose-graph | LiDAR | Yes | Yes |
DynaGSLAM (Li et al., 15 Mar 2025) | Time-varying Gaussian splats | Photorealistic 3DGS | RGB-D visual | Decoupled | Yes |
Dy3DGS-SLAM (Li et al., 6 Jun 2025) | Mask-fused dynamic suppression | Photorealistic 3DGS | Monocular RGB | Yes | Yes |
These systems collectively demonstrate the trajectory of dynamic object SLAM towards architectures that are robust to scene variability, support rich scene reconstructions, and facilitate advanced robotics and perception applications.