General Dynamic Scene Reconstruction
- General dynamic scene reconstruction is the process of creating temporally coherent 3D models from sensor inputs that capture both static and moving elements.
- Key methods include discrete models, continuous neural fields, and Gaussian splatting, balancing fidelity, scalability, and real-time performance.
- Advanced motion modeling, optimization strategies, and instance-level segmentation enable accurate representation of complex, deformable, and articulated scene dynamics.
General dynamic scene reconstruction refers to the recovery of temporally coherent, spatially accurate 3D representations of complex environments containing both static and dynamically moving elements, typically from image, video, or sensor input. This problem encompasses arbitrary scenes where geometry, appearance, and motion may change continuously over time, including rigid, articulated, and non-rigid transformations, in both indoor and outdoor environments. The domain is foundational to virtual and augmented reality, robotics, simulation, film, and embodied AI. Recent advances extend reconstruction beyond rigid or template-constrained targets, targeting densely time-varying scenes with minimal prior assumptions on object category or camera motion.
1. Core Representation Paradigms
Several representation strategies underlie dynamic scene reconstruction, each with distinct trade-offs around fidelity, flexibility, and scalability:
- Discrete Models: Traditional methods employed dense point clouds, surfel fusion, meshes (including non-rigid surface deformation graphs), and voxel grids with temporal fusions such as TSDFs. While straightforward to integrate with classical SLAM pipelines, these are limited by scalability and ability to capture fine non-rigid deformations, especially in the presence of topology change (Yunus et al., 2024).
- Continuous Neural Fields: Neural implicit representations have become dominant—especially neural radiance fields (NeRF), signed distance fields (SDF), and neural occupancy fields. Dynamic extensions (e.g., D-NeRF, HyperNeRF) apply temporal modulation, learn spatio-temporal warps, or augment with hyper-coordinates to encode topology changes (Yunus et al., 2024, Chen et al., 2023). These methods seamlessly address topology transition, material change, and smooth deformation, at the expense of increased optimization complexity.
- Gaussian Splatting and Hybrid Models: Recent state-of-the-art methods utilize explicit 3D Gaussian primitives. Each “splat” is parameterized by position, covariance, color, and opacity or more complex time-varying attributes (Lin et al., 11 Jun 2025, Ma et al., 27 Jun 2025, Su et al., 10 Nov 2025). Dynamic scene variants extend this with motion modeling, instance labels, and inter-frame correspondences. These methods have emerged as highly scalable, efficient, and photorealistic volumetric representations capable of both real-time simulation and instance-level control.
2. Motion Modeling and Dynamic Decomposition
Dynamic scene reconstruction fundamentally involves disentangling static and moving elements and estimating their respective motions:
- Static-Dynamic Separation: Various strategies use geometric variance of Gaussian offsets (Deng et al., 11 Jun 2025), flow-field consistency (Deng et al., 11 Jun 2025, Su et al., 10 Nov 2025), predictive dynamic masks (Chen et al., 2 Dec 2025), or prior-based instance tracking (e.g., using 2D trackers and segmentation models (Su et al., 10 Nov 2025, Xie et al., 2024)). Explicit dual-memory networks further decouple stable static geometry and rapidly updating dynamic features (Cai et al., 11 Aug 2025).
- Motion Parameterization: Approaches include per-frame translation (scene flow) fields (Lin et al., 11 Jun 2025), learnable offset trajectories (e.g., Bézier curves (Ma et al., 27 Jun 2025)), hierarchical decomposition into coarse global and fine local (non-rigid) motion (Deng et al., 11 Jun 2025), and per-Gaussian velocity/lifespan attributes (Su et al., 10 Nov 2025). Notably, motion can be encoded via deformation fields coupling spatial coordinates and time, often supervised or regularized via flow/motion metrics extracted from the input.
- Canonical Alignment and Deformation: Many pipelines define canonical object or actor coordinate spaces (e.g., for vehicles or pedestrians (Chen et al., 2024)), with per-frame deformations mapped via SE(3) (for rigid motion) or LBS/MLP-based techniques (for non-rigid phenomena). This enables physically grounded, temporally smooth reconstructions even in large-scale settings.
3. Training Objectives and Optimization
Dynamic scene methods employ multi-faceted, highly supervised objectives:
- Photometric and Perceptual Losses: Core supervision is from photometric reconstruction between input and rendered images, often complemented by perceptual metrics (e.g., LPIPS, SSIM) (Lin et al., 11 Jun 2025, Ma et al., 27 Jun 2025).
- Depth and Flow Supervision: Depth priors (from LiDAR or depth sensors) and flow fields (from multi-view or monocular flow estimation) provide critical regularization, especially to resolve ambiguities inherent in monocular or sparsely calibrated setups (Lin et al., 11 Jun 2025, Xie et al., 2024, Wu et al., 3 Jul 2025).
- Specialized Losses: These include inter-curve consistency (to regularize motion paths (Ma et al., 27 Jun 2025)), instance-label and velocity/motion consistency (enforcing semantic and kinematic clustering (Su et al., 10 Nov 2025)), total variation on dynamic parameters (Deng et al., 11 Jun 2025), dynamic-only supervision (to ensure correct separation and attribution (Su et al., 10 Nov 2025, Ma et al., 27 Jun 2025)), and opacity/foreground constraints for correct ray termination (Deng et al., 11 Jun 2025).
- Efficient Optimization: Pipelines employ a mixture of end-to-end differentiable rendering (through splat-based rasterization or NeRF-like volumetric integrals), alternating coordinate descent (e.g., on surfaces/poses for LiDAR data (Chodosh et al., 2024)), and feedforward or transformer-based architectures for real-time scene prediction (Lin et al., 11 Jun 2025, Chen et al., 2 Dec 2025).
4. Instance-Level and Semantic Reconstruction
Modern dynamic scene reconstruction targets not only global scene geometry but also fine-grained, instance-aware representations:
- Instance Segmentation and Tracking: Techniques exploit temporal inconsistencies for unsupervised instance discovery (Su et al., 10 Nov 2025), or use 2D mask propagation with 3D lifting and clustering (Lin et al., 11 Jun 2025, Li et al., 17 Oct 2025). Multi-object decomposition is supported in pipelines with asset-driven or semantic-aware deformation, combining 3D generation priors and data-driven segmentation (Biswas et al., 29 Nov 2025, Yunus et al., 2024).
- Object-Centric Parametrization: Per-object asset generation is achieved via high-fidelity neural mesh models or 3D latent generators, semantic-aware rigid and non-rigid transformations, and per-element temporal GS refinement (Biswas et al., 29 Nov 2025). Scene graphs explicitly encode node types (rigid, articulated, deformable), local canonical spaces, and their temporal evolution (Chen et al., 2024).
- Editing and Control: Explicit instance awareness enables downstream editing—removal/addition/re-animation of dynamic actors, per-object simulation, and actionable control in simulation environments (e.g., autonomous driving, robotics) (Ma et al., 27 Jun 2025, Chen et al., 2024).
5. Input Modalities and Sensor Fusion
Input configurations are diverse and tailored to the domain:
- Multi-view Video and RGB-D: The gold standard for general dynamic scenes, facilitating calibration, scalable coverage, and temporal correspondences (Mustafa et al., 2015, Mustafa et al., 2019). Multi-view setups mitigate depth ambiguity and support robust dynamic object segmentation without background priors.
- Monocular Video: Ill-posed in general, but tractable under strong geometric/depth priors, specialized scene flow supervision, or heavy learning-based regularization (e.g., transformer pipelines, explicit dual-memory architectures (Lin et al., 11 Jun 2025, Cai et al., 11 Aug 2025, Xie et al., 2024)).
- LiDAR and Multimodal Inputs: Used for large-scale urban scenes, LiDAR provides dense 3D constraints for both background and dynamic objects. Synchronization, deskewing, and per-object tracking are critical; compositional optimization combines mesh/SDF modeling with pose registration (Chodosh et al., 2024).
- Bounding Boxes, Semantic Annotations, and SMPL Priors: Frequently leveraged for initialization, supervision, and canonicalization, especially for articulated humans and multi-agent interactions (Chen et al., 2024, Biswas et al., 29 Nov 2025). However, leading-edge pipelines increasingly aim for annotation-free, unsupervised (or weakly supervised) operation (Su et al., 10 Nov 2025).
6. Algorithmic Pipelines and Time/Performance Characteristics
Pipelines exhibit significant diversity in algorithmic flow and real-time capacity:
| Class | Example Method | Performance/Notes |
|---|---|---|
| Offline Optimization | General Multi-view SfM/GS (Mustafa et al., 2015, Mustafa et al., 2019) | Robust, but slow (> minutes/frame), accurate segmentation and mesh fusion. |
| End-to-end Neural Fields | DySurf (Chen et al., 2023), DRSM (Xie et al., 2024) | High fidelity, template-free, but prohibitively high training time (hours). |
| Feedforward/Transformer | DGS-LRM (Lin et al., 11 Jun 2025), DGGT (Chen et al., 2 Dec 2025) | Real-time (2 FPS on A100, ~0.5 s/scene), scalable, enables rapid digital twinning. |
| Grid/Plane Factorization | DRSM (Xie et al., 2024), LocalDyGS (Wu et al., 3 Jul 2025) | Efficient for stationary cameras or highly dynamic local regions, enables minutes-to-convergence learning. |
| Compositional/Hybrid | SMORE (Chodosh et al., 2024), ADSR (Biswas et al., 29 Nov 2025) | Modular—fuses classical and learned components to maximize robustness under occlusion, partial/unknown object sets. |
Notably, Gaussian-splatting approaches routinely achieve >60 Hz rendering and sub-minute or even real-time optimization (given sufficient GPU resources) (Chen et al., 2024, Ma et al., 27 Jun 2025, Wu et al., 3 Jul 2025).
7. Evaluation, Limitations, and Prospects
Dynamic scene reconstruction is benchmarked using photometric (PSNR, SSIM, LPIPS), geometric (Chamfer, F-score, accuracy/completeness), pose (ATE, RPE), and semantic/instance metrics. SOTA methods report marked improvements on Waymo, nuPlan, KITTI, HOI-M3, and synthetic multi-human/object datasets (Ma et al., 27 Jun 2025, Biswas et al., 29 Nov 2025).
Known Limitations and Open Challenges (Yunus et al., 2024):
- Monocular ambiguity and under-constrained optimization in absence of depth/multi-view.
- Occlusion handling and topology change (e.g., rapid contacts, disassembly) remain open.
- Data efficiency, real-time scaling, and memory management for long, large scenes.
- Illumination/material intrinsic decomposition lags static scene modeling.
- Compositionality/multi-agent interaction: Instance modeling of fine/object-level and part-level dynamics in generic scenes is still in early phases.
- Self-supervision and generalization: Removing the need for expensive annotation, integrating vision-language priors, and scaling to open-world scenarios are active research frontiers.
Promising directions include integration of instant-NGP/hashing for neural field acceleration, hybrid neural-explicit scene graphs, self-supervised part/object segmentation and dynamic loop closure, and bridging physical simulation for causal, controllable, real-time digital twins.
References
- DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos (Lin et al., 11 Jun 2025)
- Proactive Scene Decomposition and Reconstruction (Li et al., 17 Oct 2025)
- DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting (Su et al., 10 Nov 2025)
- BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting (Ma et al., 27 Jun 2025)
- DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction (Deng et al., 11 Jun 2025)
- Asset-Driven Semantic Reconstruction of Dynamic Scene with Multi-Human-Object Interactions (Biswas et al., 29 Nov 2025)
- LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling (Wu et al., 3 Jul 2025)
- DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras (Xie et al., 2024)
- OmniRe: Omni Urban Scene Reconstruction (Chen et al., 2024)
- SMORE: Simultaneous Map and Object REconstruction (Chodosh et al., 2024)
- Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction (Cai et al., 11 Aug 2025)
- Recent Trends in 3D Reconstruction of General Non-Rigid Scenes (Yunus et al., 2024)