LocalDyGS: Dynamic Scene Reconstruction
- LocalDyGS is a dynamic 3D reconstruction framework that partitions complex scenes into localized spaces to capture both fine-scale and large-scale motions.
- It decouples static and dynamic features within each local space via adaptive fusion and temporal Gaussian parameterization to enhance efficiency and accuracy.
- Adaptive seed growing and multi-view SfM integration enable robust 3D modeling for applications in AR/VR, gaming, and dynamic scene analysis.
LocalDyGS is a framework for dynamic scene reconstruction from multi-view video, designed to accurately and efficiently model both fine-scale and large-scale motions in highly dynamic real-world scenes. The central contribution is an adaptive decomposition of the global scene into multiple local spaces, each represented by decoupled static and dynamic features, which are fused to generate time-varying Temporal Gaussians as rendering primitives. This approach makes it possible to reconstruct complex, temporally evolving motion for arbitrary viewpoints, overcoming key limitations of prior neural radiance field and 3D Gaussian splatting methods, especially in cases involving large-scale dynamic scenes (2507.02363).
1. Local Space Decomposition
LocalDyGS partitions the entire dynamic scene into a collection of local spaces, each defined by a seed point. Seed initialization is performed by fusing a Structure-from-Motion (SfM) point cloud collected from multiple frames, so that seeds are distributed in regions where dynamic objects are present. Each seed marks the center of a local space covering a spatial neighborhood, whose size is adjustable by a learned scale parameter . This decomposition reduces the global dynamic modeling problem into multiple localized subproblems, where both motion boundaries and fine details can be better captured within each local context. The number and placement of seeds directly control the framework’s capacity to represent complex and multi-scale dynamics across the scene.
2. Decoupling Static and Dynamic Features
A distinctive aspect of the LocalDyGS framework is the explicit separation (decoupling) of static and dynamic representations within each local space. For any seed:
- A static feature is learned and shared across all time steps, encoding the time-invariant geometric and radiance properties of the scene near the seed.
- A dynamic residual feature is provided by a global four-dimensional hash-encoded residual field (space-time), which captures time-specific changes at that seed.
The two feature streams are fused at each sampling time via an adaptively weighted sum, where the weights and are predicted by a shallow MLP conditioned on the seed position and query time:
This fusion allows the dynamic component to selectively focus only on changes, greatly reducing redundancy when large portions of the scene remain static over time.
3. Temporal Gaussian Parameterization
Within each local space, motion is modeled using a set of Temporal Gaussians whose parameters are predicted from the fused feature . For each Gaussian, essential rendering parameters are produced via shallow MLPs:
- Mean: ( is the seed position, is a small MLP)
- Opacity: , with indicating direction
- Scale, rotation, and color: produced analogously via dedicated predictors
Temporal Gaussians are activated only during time intervals corresponding to local motion. If the predicted opacity falls below a threshold , the corresponding Gaussian is automatically pruned—improving computational efficiency while maintaining fidelity. Unlike methods that seek to optimize continuous 4D trajectories, LocalDyGS does not attempt to reconstruct long-term trajectories for every point, but instead dynamically generates local Gaussians for each time step, better handling complicated, rapidly varying motion.
4. Adaptive Seed Growing
To ensure spatial completeness, LocalDyGS incorporates an Adaptive Seed Growing (ASG) mechanism. During optimization, additional seeds are injected in regions where the 2D projection gradients of the current reconstruction (i.e., reprojection error) exceed a threshold . Newly added seeds supplement the initial SfM-based cloud, improving coverage especially in under-represented or occluded regions, and enabling the system to adaptively refine itself during training.
5. Implementation and Training Pipeline
The LocalDyGS system processes synchronized multi-view video frames as follows:
- Seed Initialization: Aggregate an SfM point cloud from frames to place initial seeds.
- Local Feature Learning: For each seed, learn a static feature vector and initialize a spatially localized 4D hash field for dynamic residuals.
- Gaussian Parameter Prediction: For each local space and each time, compute and decode the parameters for Temporal Gaussians.
- Rendering: Project all active Gaussians to each camera view at every time step and aggregate their contributions for image synthesis.
- Optimization: Jointly train all learnable modules (static features, hash fields, weight field MLP, Temporal Gaussian decoders) with photometric loss between rendered and observed images, along with regularization terms for sparsity (opacity thresholding).
- Seed Growing: Monitor the 2D projection error and add new seeds where needed during training.
This modular pipeline enables scalable, efficient, and adaptive modeling. Training and inference can be accelerated due to the natural local-to-global parallelization structure and pruning of inactive Gaussians.
6. Empirical Performance and Comparison
LocalDyGS demonstrates state-of-the-art reconstruction quality on both fine-scale motion benchmarks (N3DV, MeetRoom) and large-scale dynamic scenes (e.g., a basketball court dataset, VRU). Metrics such as PSNR, perceptual distance (LPIPS/DSSIM), frames-per-second (FPS), and storage usage show that LocalDyGS achieves:
- Sharper and more temporally consistent novel view synthesis compared to prior NeRF or 3D Gaussian-based methods.
- Higher efficiency, with total network capacity (e.g., ~100 MB) lower than many competing approaches.
- Reduced training time and faster convergence.
In large-scale scenes with highly nonrigid motion, previous methods that rely on global trajectory optimization or dense radiance fields often fail or require extreme computational resources, while LocalDyGS remains efficient due to its local space decomposition and temporal Gaussian mechanism. On static or slowly-varying backgrounds, the method efficiently allocates resources by allowing the static feature stream and pruning inactive Gaussians.
7. Applications and Future Directions
LocalDyGS is well-suited for applications requiring accurate, temporally resolved, and efficient 3D scene reconstruction:
- Free-viewpoint video for immersive events or AR/VR, particularly where dynamic actors or large-scale motion present challenges for conventional radiance field or splatting approaches.
- Real-time or near-real-time dynamic scene capture for gaming, visual effects, or robotics, where parallel inference and efficient representation are required.
- Scenarios where only multi-view synchronized video is available and no dense 3D measurements can be obtained.
Possible future work includes developing new geometric priors or initializing seeds and local spaces in more challenging settings (e.g., from monocular videos or with imperfect SfM results), extending the approach to real-time online optimization, or further compressing the temporal Gaussian representation for storage and latency-sensitive deployments.
In summary, LocalDyGS advances the modeling of dynamic 3D scenes by combining local decomposition, feature decoupling, and time-dependent adaptive rendering primitives, resulting in both high accuracy and scalable computational performance for a wide range of dynamic reconstruction tasks (2507.02363).