Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video (2512.11356v1)

Published 12 Dec 2025 in cs.CV

Abstract: We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings

Summary

The paper presents a fully automated pipeline that enhances dynamic scene reconstruction by refining segmentation masks, depth estimates, and point tracks.
It integrates novel loss functions, including virtual-view depth and scaffold-projection losses, to improve geometric coherence and suppress artifacts.
Quantitative results show improved PSNR, SSIM, and LPIPS metrics over existing methods, demonstrating robustness in diverse, in-the-wild scenes.

Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Introduction

This paper presents a fully automated pipeline for dynamic scene reconstruction from monocular RGB videos, focused on improving the quality of foundation model priors integrated into Dynamic Gaussian Splatting (DGS). Gaussian Splatting has established itself as an efficient, point-based scene representation capable of capturing static and dynamic geometry with real-time rasterization. The proposed method enhances depth, masks, and point tracks sourced from vision foundation models to overcome critical bottlenecks in dynamic scene fidelity—particularly for thin structures and complex object motions—without inventing a new scene representation.

Method Overview

The approach follows a three-stage pipeline: (I) Initialization, (II) Lifting to 3-D, and (III) Dynamic Scene Reconstruction. Initialization extracts dynamic object segmentation masks using video segmentation fused with epipolar error maps, enabling more precise identification of moving object regions. The masks guide depth refinement and skeleton-based sampling for robust 2-D trajectory extraction, including mask-guided re-identification for occlusion recovery. These higher-quality priors are embedded into the reconstruction process via a virtual-view depth loss to prevent floaters and a scaffold-projection loss for geometric coherence.

Figure 1: The three-stage pipeline enabling high-fidelity monocular dynamic scene reconstruction via prior enhancement and novel supervisory objectives.

Foundation Model Prior Enhancement

Dynamic Object Mask Selection

By intersecting epipolar error masks (EPI) with video segmentation, the method reliably isolates dynamic objects—removing static distractors (e.g., shadows) with dual-threshold filtering. This robust masking enables targeted depth and track supervision.

Figure 2: Dynamic mask selection via EPI error mask and segment intersection, ensuring only salient dynamic object regions are retained for reconstruction.

Mono-depth estimates (from MoGe) are aligned to the temporally consistent video depth; an object-depth loss is introduced to sharpen dynamic object detail within the mask. This correction recovers thin, mobile structures and maintains global scene alignment, enhancing fidelity for both geometry and appearance.

Figure 3: Object-specific depth refinement significantly reconstructs thin dynamic structures lost in standard depth estimation approaches.

Mask-Guided Point Tracking

Skeletons extracted from dynamic masks and densely sampled along medial axes yield more informative 2-D tracks, particularly across challenging regions (e.g., limbs). Mask-guided re-identification recovers tracks lost to occlusion by leveraging object mask consistency. A self-occlusion filter rejects false track re-entries based on trajectory deviation.

Figure 4: Skeleton-based track sampling ensures dense coverage of complex structures, ameliorating undersampling issues in conventional mask approaches.

Figure 5: Mask-guided track re-identification bridges gaps caused by occlusion, improving overall dynamic scene coverage.

3-D Lifting and Motion Scaffolding

Dynamic pixels are back-projected to initialize 3-D Gaussians; long-range tracks are promoted to sparse motion-scaffold nodes. Spatial and temporal optimization regularize the initial scene estimate, providing coherent starting points for downstream refinement.

Reconstruction Supervision with Novel Objectives

The main supervision signals (appearance, depth, track loss) propagate prior-derived constraints onto both the Gaussian cloud and the underlying motion scaffolds. Two critical new losses—virtual-view depth and scaffold-projection—directly address consistent dynamic geometry and floating artifact suppression.

Virtual-View Depth Loss

By generating synthetic views via camera extrinsic perturbation, a depth loss on these virtual viewpoints regularizes the scene to prevent view-dependent floaters and overfitting to training perspectives. This significantly reduces out-of-distribution artifacts for novel view synthesis.

Figure 6: Virtual-view depth loss removes floaters associated with limited baseline training views, preserving clean geometry under viewpoint shift.

Scaffold-Projection Loss

The scaffold-projection loss anchors scaffold nodes to projected 2-D tracks, preventing drift (especially for thin, fast-moving segments) otherwise unaddressed by ARAP regularization.

Figure 7: Scaffold-projection loss maintains track-to-scaffold consistency, culling spatial drift in thin or highly articulated regions.

Experimental Results

Quantitative Comparisons

On DyCheck [gao2022monocular], the method outperforms baseline monocular DGS systems Shape of Motion [wang2024shape] and MoSca [lei2024mosca] across PSNR, SSIM, and LPIPS metrics in both pose-free RGB and depth+pose regimes. On NVIDIA multi-view videos, results are marginally higher than MoSca and clearly better than RoDynRF [liu2023robust].

Ablation Studies

Mask-guided point tracking and re-ID, scaffold-projection loss, and virtual-view depth regularization each contribute to geometric and appearance improvement, especially for thin moving regions and occlusion-heavy sequences.

Figure 8: Comparative view synthesis on iPhone dataset; proposed method preserves texture and geometry in challenging dynamic regions compared to baselines.

Figure 9: Skeleton sampling increases fidelity in thin regions compared to uniform EPI error mask sampling.

Figure 10: Track re-identification staves off occlusion-induced failures compared to baseline trackers.

Figure 11: Scaffold-projection loss ablation—critical for maintaining spatial correspondence in articulated regions.

Figure 12: Local depth refinement further improves fine structure preservation.

Figure 13: Depth refinement sharply affects novel view synthesis, restoring true object geometry.

Figure 14: Virtual-view depth loss ablation; presence yields superior artifact-free novel viewpoints.

Implications and Future Directions

The results demonstrate the significant impact of leveraging and refining foundation model outputs—specifically segmentation, depth estimation, and point tracking—rather than solely elaborating scene representations. The pipeline achieves state-of-the-art results in dynamic monocular video, fully automatically, and demonstrates robustness to challenging in-the-wild sequences (DAVIS). Practically, it advances AR/VR content workflows where sensor and calibration scarcity is common.

Theoretically, the approach highlights the bottleneck shift from physical representations to prior quality and supervisory signal alignment in contemporary dynamic reconstruction. Future development in text/image/video foundation models and tracking will directly carry over to further gains in scene fidelity. Integrating generative video models may address present limitations in handling motion blur and unseen regions.

Conclusion

This work demonstrates that targeted enhancement of segmentation, depth, and track priors—combined with loss terms designed to propagate these improvements—substantially advances the performance envelope of monocular dynamic scene reconstruction with Gaussian Splatting. The approach is modular, scalable, and readily extensible to new vision priors, setting the stage for further improvements in both algorithmic sophistication and practical deployability.

Figure 15: Qualitative improvements in DAVIS dataset synthesis; proposed pipeline robustly reconstructs dynamic scenes with fine detail and background integrity.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper to Video (Beta)

Generate a video overview of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video (2512.11356v1)

Summary

Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Introduction

Method Overview

Foundation Model Prior Enhancement

Dynamic Object Mask Selection

Depth Refinement

Mask-Guided Point Tracking

3-D Lifting and Motion Scaffolding

Reconstruction Supervision with Novel Objectives

Virtual-View Depth Loss

Scaffold-Projection Loss

Experimental Results

Quantitative Comparisons

Ablation Studies

Implications and Future Directions

Conclusion

Whiteboard

Paper to Video (Beta)

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video (2512.11356v1)

Sponsor

Summary

Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Introduction

Method Overview

Foundation Model Prior Enhancement

Dynamic Object Mask Selection

Depth Refinement

Mask-Guided Point Tracking

3-D Lifting and Motion Scaffolding

Reconstruction Supervision with Novel Objectives

Virtual-View Depth Loss

Scaffold-Projection Loss

Experimental Results

Quantitative Comparisons

Ablation Studies

Implications and Future Directions

Conclusion

Whiteboard

Paper to Video (Beta)

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets