- The paper introduces a dynamic view synthesis method that leverages Gaussian splatting to render high-quality novel views from casual monocular videos.
- It employs a novel 3D-aware initialization scheme that converts estimated depth and 2D flow into accurate 3D deformation fields for improved scene reconstruction.
- Experimental results demonstrate that MoDGS significantly outperforms previous methods with better PSNR, SSIM, and LPIPS metrics on multiple datasets.
MoDGS: Dynamic Gaussian Splatting from Casually-Captured Monocular Videos
The paper "MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos" introduces a new methodology for Dynamic View Synthesis (DVS) based on Gaussian splatting techniques. The goal of MoDGS is to render high-quality novel-view images from casually captured monocular videos of dynamic scenes, addressing the limitations of existing methods that rely on synchronized cameras or "teleporting" camera motions.
Introduction and Background
Novel view synthesis (NVS) is crucial for various applications in computer graphics and computer vision, such as virtual reality and augmented reality. While techniques like NeRF, Instant-NGP, and Gaussian Splatting have significantly improved NVS for static scenes, synthesizing novel views in dynamic scenes from a single monocular video remains challenging. Traditional Dynamic View Synthesis methods require multiview videos with large camera movements to maintain multiview consistency, which is impractical for casual video captures that often employ smooth or static camera motions.
Methodology
MoDGS introduces a pipeline that solves the weak multiview constraint challenge inherent in casually captured monocular videos. It employs a monocular depth estimation method and makes several key contributions:
- 3D-Aware Initialization:
- To overcome initialization issues in weakly constrained settings, MoDGS introduces a 3D-aware initialization scheme for the deformation field. This scheme utilizes estimated depth maps and 2D flow predictions between frames to derive a coarse scale for each frame, normalizing them to maintain consistency.
- Deformation Field Initialization:
- The initialization process involves lifting the 2D flow and depth maps to 3D flow, which then supervises the initial training of the deformation field. This ensures an accurate starting point for the deformation field, improving subsequent scene reconstruction.
- Gaussian Initialization:
- Depth maps across different frames are transformed back to a canonical space, providing a robust initial set of points for creating Gaussians. This method ensures that the Gaussians are well-distributed and accurately represent the 3D structure of the scene.
- Ordinal Depth Loss:
- The authors propose an ordinal depth loss that leverages the order consistency of depth values across frames, addressing the inaccuracies in estimated depth maps that the Pearson correlation loss cannot rectify. This novel loss ensures stable depth orders, enhancing the realism of the rendered views.
Experimental Results
MoDGS demonstrates significant performance improvements on various datasets, including the DyNeRF and Nvidia datasets. The results show superior PSNR, SSIM, and LPIPS metrics compared to baseline methods such as Deformable-GS, SC-GS, and HexPlane. MoDGS also exhibits robustness in reconstructing dynamic scenes with detailed and artifact-free novel-view images, as evidenced by qualitative comparisons.
Quantitative Results
- Nvidia Dataset:
- Average PSNR: 23.27 (MoDGS) vs. 21.47 (Deformable-GS)
- Average SSIM: 0.6908 (MoDGS) vs. 0.5948 (Deformable-GS)
- Average LPIPS: 0.1826 (MoDGS) vs. 0.2661 (Deformable-GS)
- DyNeRF Dataset:
- Average PSNR: 22.90 (MoDGS) vs. 19.76 (Deformable-GS)
- Average SSIM: 0.8043 (MoDGS) vs. 0.7528 (Deformable-GS)
- Average LPIPS: 0.1575 (MoDGS) vs. 0.2172 (Deformable-GS)
Implications and Future Work
The introduction of MoDGS opens up new avenues for practical and high-quality dynamic view synthesis from casually captured videos. The implications are notable for consumer applications, such as user-generated content on social media and personal video archives. The methodology demonstrates that high-quality novel views can be synthesized without requiring complex, synchronized, and large-movement camera setups.
Future developments could explore further optimizing the monocular depth estimation for even more accurate reconstruction and investigating the integration of generative models like 3D-related diffusion models to handle unseen parts of the scene. Additionally, addressing the computational efficiency for faster training and real-time rendering remains a promising area for further innovation.
Conclusion
MoDGS represents a substantial advancement in dynamic view synthesis from casually captured monocular videos, solving critical challenges through 3D-aware initialization and ordinal depth loss. Its performance across various datasets and settings underscores its potential for widespread application and further research in dynamic scene rendering.