SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image (2204.00928v2)

Published 2 Apr 2022 in cs.CV

Abstract: Despite the rapid development of Neural Radiance Field (NeRF), the necessity of dense covers largely prohibits its wider applications. While several recent works have attempted to address this issue, they either operate with sparse views (yet still, a few of them) or on simple objects/scenes. In this work, we consider a more ambitious task: training neural radiance field, over realistically complex visual scenes, by "looking only once", i.e., using only a single view. To attain this goal, we present a Single View NeRF (SinNeRF) framework consisting of thoughtfully designed semantic and geometry regularizations. Specifically, SinNeRF constructs a semi-supervised learning process, where we introduce and propagate geometry pseudo labels and semantic pseudo labels to guide the progressive training process. Extensive experiments are conducted on complex scene benchmarks, including NeRF synthetic dataset, Local Light Field Fusion dataset, and DTU dataset. We show that even without pre-training on multi-view datasets, SinNeRF can yield photo-realistic novel-view synthesis results. Under the single image setting, SinNeRF significantly outperforms the current state-of-the-art NeRF baselines in all cases. Project page: https://vita-group.github.io/SinNeRF/

Citations (167)

View on Semantic Scholar

Summary

The paper proposes a single-view strategy that uses semantic and geometry pseudo labels for effective depth and texture reconstruction.
The framework leverages progressive strided ray and Gaussian pose sampling along with warping-based depth supervision to stabilize training and improve novel view synthesis.
The method outperforms state-of-the-art models in PSNR, SSIM, and LPIPS across multiple benchmarks, broadening NeRF's applicability to real-world scenarios.

SinNeRF: Training Neural Radiance Fields from a Single Image in Complex Scenes

The presented paper introduces an innovative methodology, SinNeRF, for training Neural Radiance Fields (NeRFs) using solely a single image to achieve novel view synthesis in complex scenes. NeRFs have gained prominence as an efficient scene representation in computer vision, particularly for synthesizing photorealistic images from various viewpoints. Traditional NeRF approaches, however, necessitate multiple views along with precise camera poses, which restricts their applicability in real-world scenarios where capturing dense views is challenging.

Key Contributions

Single-View Approach: Unlike existing methods that require sparse inputs of at least a few views, SinNeRF pushes this constraint further by utilizing only a single view. This novel strategy is critical for applications where acquiring additional views is impractical.
Framework Design: The framework adopts a semi-supervised learning approach incorporating pseudo labels based on semantic and geometry regularization. Geometry pseudo labels are generated through image warping techniques that propagate depth information, ensuring consistency across multiple views. Semantic pseudo labels are formed using local texture guidance and global structure priors, enabled through adversarial learning and Vision Transformer (ViT) embeddings.
Performance Evaluation: SinNeRF demonstrates superior performance against state-of-the-art methods such as DS-NeRF, DietNeRF, and PixelNeRF across various benchmarks, including the NeRF synthetic, Local Light Field Fusion (LLFF), and DTU datasets. The quantitative metrics — PSNR, SSIM, and LPIPS — validate its effectiveness in producing photorealistic novel view syntheses even without pre-training on multi-view datasets.

Technical Insights

Geometry Pseudo Labels: The use of depth information from the reference view to project onto novel views via warping is pivotal in maintaining geometric consistency. This ensures accurate 3D reconstruction from a single image input by employing depth map supervision and enforcing depth smoothness constraints.
Semantic Pseudo Labels: The integration of a patch discriminator facilitates more refined texture synthesis, while semantic consistency is enforced through a pre-trained ViT, which comprehends complex global structures despite pixel-level misalignment across views.
Progressive Training Strategy: The authors implement a progressive strided ray sampling and Gaussian pose sampling, which helps stabilize training and ensures robust synthesis from previously unseen poses, tackling overfitting issues effectively.

Implications and Future Directions

The presented methodology broadens the horizon for NeRF by enabling training from minimal input data, a significant advancement for scenarios such as augmented reality (AR) and autonomous driving where capturing extensive viewpoints is logistically difficult. This approach also suggests future possibilities for optimizing NeRF models in terms of training efficiency and extending them to even more constrained input settings. Continuing research might explore hybrid models that incorporate sparse view inputs along with the single image approach for further enhancing scene realism and detail preservation.

SinNeRF marks a stride towards achieving efficient view synthesis in computationally challenging environments, ultimately pushing the boundaries of applying NeRF in practical and industrial scenarios. As computational methods advance, the role of semi-supervised learning frameworks and pseudo-labeling strategies is likely to become increasingly prominent, offering frameworks such as SinNeRF a pathway to refined adaptation and application across diverse domains.

PDF Markdown

Related Papers

GitHub

SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image