Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos (2406.18717v2)

Published 26 Jun 2024 in cs.CV

Abstract: Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision. In this work, we are interested in extending the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose a method we call Dynamic Gaussian Marbles, which consist of three core modifications that target the difficulties of the monocular setting. First, we use isotropic Gaussian "marbles'', reducing the degrees of freedom of each Gaussian. Second, we employ a hierarchical divide and-conquer learning strategy to efficiently guide the optimization towards solutions with globally coherent motion. Finally, we add image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization, Dynamic Gaussian Marbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the Nvidia Dynamic Scenes dataset and the DyCheck iPhone dataset, and show that Gaussian Marbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians. Our project page can be found here https://geometry.stanford.edu/projects/dynamic-gaussian-marbles.github.io/.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a novel method, Dynamic Gaussian Marbles, that reconstructs dynamic scenes from casual monocular videos using isotropic Gaussians.
It employs a hierarchical divide-and-conquer strategy to optimize Gaussian trajectories by segmenting video subsequences and merging them for global alignment.
Evaluation on Nvidia Dynamic Scenes and DyCheck iPhone datasets shows improved perceptual quality and tracking accuracy over previous methods.

Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

The paper "Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos" presents a novel approach for reconstructing dynamic scenes from monocular video inputs using a representation called Dynamic Gaussian Marbles (DGMarbles). This research addresses the challenge of novel view synthesis from casually captured monocular videos, a problem that extends current 3D reconstruction capabilities beyond controlled environments with dense multi-view data.

The authors identify that previous Gaussian-based 4D scene representations struggle when applied to monocular settings due to their underconstrained nature. These settings lack the multi-view geometry typically available in controlled capture scenarios. To tackle this limitation, the authors propose DGMarbles, a technique that adapts Gaussian splatting, a method known for its efficiency and photometric quality, to dynamic and casual video contexts.

Core Components of DGMarbles

Isotropic Gaussian Marbles: DGMarbles employs isotropic rather than anisotropic Gaussians, effectively reducing the degrees of freedom in the representation. This simplification constrains the optimization process, allowing the method to focus more on capturing scene motion and appearance rather than local shape details. This approach is beneficial for the underconstrained monocular setting.
Hierarchical Divide-and-Conquer Strategy: DGMarbles applies a novel divide-and-conquer learning strategy. The video is segmented into subsequences, and each is optimized independently. Gaussian sets are merged through a hierarchical process, guiding the optimization towards a coherent global motion representation. This phase alternates between local motion estimation and global alignment, which refines the Gaussian trajectories and enhances overall scene consistency.
Integration of Recent Tracking and Prior Techniques: To further constrain the optimization, DGMarbles incorporates image-level and geometry-level priors. This includes a tracking loss based on recent advancements in point tracking algorithms like CoTracker. Additionally, geometry priors enforce local and global isometry, encouraging more realistic Gaussian trajectories.

Evaluation and Performance

DGMarbles is evaluated on the Nvidia Dynamic Scenes and DyCheck iPhone datasets. The results demonstrate that DGMarbles significantly outperforms existing Gaussian-based methods in terms of perceptual quality and rendering capabilities. Without dependence on multi-view constraints, DGMarbles achieves performance on par with state-of-the-art non-Gaussian representations. The evaluation reports strong tracking accuracy and capability for novel-view synthesis in challenging monocular scenarios.

Implications and Future Directions

DGMarbles marks a significant stride towards effective 3D reconstruction and rendering from monocular video. Its efficiency, compositionality, and tracking capabilities position it as a strong candidate for applications in video editing, virtual reality, and dynamic 3D content creation. The reduced reliance on multi-view data implies broader usability in scenarios where only single-view captures are possible.

Future avenues may explore further refining the tracking and regularization techniques to extend robustness, especially in scenes with complex dynamics and fewer constraints. Enhanced integration with learning-based depth estimation and richer image priors could propel DGMarbles' effectiveness in even more diverse settings. This research underscores a progressive step toward realistic 3D dynamic scene reconstruction in less controlled environments, suggesting promising implications for the fields of computer vision and graphics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1834222349862031538

YouTube

Show All Videos