Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos (2505.13440v1)

Published 19 May 2025 in cs.CV

Abstract: Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at https://github.com/Dwawayu/Pensieve.

Summary

The paper proposes a novel two-stage self-supervised training approach for novel view synthesis and pose estimation using only uncalibrated video.
Experimental results show superior novel view synthesis quality and pose accuracy compared to methods requiring calibrated cameras or geometric priors.
This method significantly reduces reliance on calibrated datasets, enabling large-scale training of 3D vision networks from diverse uncalibrated video sources for applications like VR and autonomous navigation.

Overview of Novel View Synthesis via Learning from Uncalibrated Videos

The paper "Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos" proposes a novel two-stage training approach to address the critical task of novel view synthesis from uncalibrated video data. This paper challenges common prerequisites found in state-of-the-art methods, such as calibrated cameras or geometric priors, which limit their applicability to vast datasets of uncalibrated videos. The authors introduce a self-supervised approach that utilizes raw video frames alone to train models that synthesize novel views while estimating camera poses accurately.

Methodology

The authors present a two-stage process:

Implicit Reconstruction Pretraining: In the first stage, the model performs an implicit scene reconstruction utilizing video frames without explicit 3D representations. It predicts per-frame latent camera features and utilizes a view synthesis model similar to LVSM to pretrain the network end-to-end, encouraging it to learn 3D consistency in a self-supervised manner. This stage leverages the advantage of neural networks to implicitly render scenes, hence bypassing the optimization biases introduced by explicit 3D representations.
Explicit Reconstruction Alignment: Recognizing the gap between the latent and real 3D scene, the second stage introduces predictions of explicit 3D Gaussian primitives. This includes applying a Gaussian Splatting rendering loss and a depth projection loss to align latent representations with concrete 3D geometry. By enforcing explicit 3D consistency, the alignment complements the pretraining phase to refine the results.

These phases are intricately connected and prove mutually beneficial, leading to high-quality novel view synthesis that effectively estimates camera poses compared to existing methods that depend on supervision from calibrated camera data.

Experimental Results

The authors offer extensive experimental validation on datasets like RealEstate10K and DL3DV-10K. Their approach demonstrates superior quality in novel view synthesis and pose estimation, even when benchmarking against techniques that use additional data such as camera parameters and depths. Notably, the method excels by solely training on uncalibrated video data, showcasing optimized accuracy through their innovative interpolated-frame prediction scheme.

Implications

The practical implications of this paper are significant. By reducing the dependency on calibrated datasets, this work opens avenues for training 3D vision networks on large-scale, diverse datasets, which previously were impractical. This self-supervised paradigm can potentially allow unrestricted access to video content for training, boosting the usability of novel view synthesis in applications ranging from virtual reality to autonomous navigation systems.

Conclusion and Future Directions

In conclusion, the paper provides a compelling framework for novel view synthesis that stands independent of external geometric priors and calibration data. Future directions could explore extending this methodology to dynamic scenes, thereby overcoming another barrier in the field of unconstrained, real-world video data. As AI continues to evolve, approaches rooted in self-supervision and implicit learning like the one presented in this work may offer robust solutions adaptable across varying domains.

Related Papers

Find Related Papers

GitHub

GitHub - Dwawayu/Pensieve (2 stars)

Tweets

https://twitter.com/zhenjun_zhao/status/1924779024187842723

YouTube

Show All Videos