Dynamic View Synthesis from Dynamic Monocular Video (2105.06468v1)

Published 13 May 2021 in cs.CV

Abstract: We present an algorithm for generating novel views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene. Our work builds upon recent advances in neural implicit representation and uses continuous and differentiable functions for modeling the time-varying structure and the appearance of the scene. We jointly train a time-invariant static NeRF and a time-varying dynamic NeRF, and learn how to blend the results in an unsupervised manner. However, learning this implicit function from a single video is highly ill-posed (with infinitely many solutions that match the input video). To resolve the ambiguity, we introduce regularization losses to encourage a more physically plausible solution. We show extensive quantitative and qualitative results of dynamic view synthesis from casually captured videos.

Citations (361)

View on Semantic Scholar

Summary

The paper presents a novel method for modeling dynamic radiance fields by jointly training static and dynamic NeRFs with scene flow predictions.
It leverages regularization and motion matching losses to ensure temporal consistency and mitigate ambiguities from single-view inputs.
Experiments on the Dynamic Scene Dataset show significant improvements in PSNR and LPIPS, advancing applications in virtual reality and immersive media.

Dynamic View Synthesis from Dynamic Monocular Video: An Overview

This paper, authored by Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang, presents an innovative algorithm designed for synthesizing dynamic views using dynamic monocular video inputs. Leveraging advancements in neural implicit representations, the algorithm effectively generates novel views at arbitrary camera angles and time steps. This work employs a combination of static and dynamic Neural Radiance Fields (NeRFs), integrated with unsupervised learning for seamless blending, to address the inherent challenges in dynamic view synthesis.

Core Contributions

The primary contribution of this research lies in its method for modeling dynamic radiance fields. The approach entails the joint training of a static NeRF and a dynamic NeRF, with the latter tasked with modeling time-varying aspects of the scene. The critical innovation here is the use of scene flow prediction to provide a multi-view constraint, crucial in a setting where each frame provides only a singular temporal perspective. By predicting forward and backward scene flows, the method effectively warps radiance fields between adjacent time frames, aiding in the establishment of temporal consistency.

Additionally, the use of regularization losses is pivotal in resolving ambiguities inherent to single-view inputs. These include motion matching loss, which aligns the predicted scene flow with estimated 2D optical flows, alongside several other regularization techniques aimed at ensuring plausible geometry and temporal smoothness.

Experimental Results

The algorithm was rigorously validated on the Dynamic Scene Dataset, showing notable improvements over existing methods, including those by Yoon et al. The results indicate significant enhancements in PSNR and LPIPS metrics, underscoring the efficacy of the proposed multi-view constraints and regularizations in achieving high-quality dynamic view synthesis. The ability to synthesize photorealistic views even in dynamic scenarios highlights the robustness of this approach.

Implications and Future Developments

Practically, this research has various applications across domains such as virtual reality, immersive media production, and telepresence, where dynamic and interactive viewing experiences are desired. Theoretically, it advances the understanding of neural implicit representations in handling time-varying structures, pushing forward the capabilities of visually coherent scene reconstruction from monocular inputs.

Future developments may extend this work to handle more complex motion dynamics and larger data sets more efficiently. Further refinement of regularization techniques may reduce the model's dependency on accurate optical flow estimation, potentially improving performance in more challenging conditions, such as those involving non-rigid deformations or significant occlusions.

Overall, this paper represents a key step toward more versatile and accessible dynamic view synthesis, setting the stage for ongoing innovations in modeled visual environments.

PDF Markdown