pixelNeRF: Neural Radiance Fields from One or Few Images (2012.02190v3)

Published 3 Dec 2020 in cs.CV, cs.GR, and cs.LG

Abstract: We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website: https://alexyu.net/pixelnerf

View on arXiv

Authors (4)

Alex Yu (6 papers)
Vickie Ye (10 papers)
Matthew Tancik (26 papers)
Angjoo Kanazawa (84 papers)

Citations (1,526)

View on Semantic Scholar

Summary

pixelNeRF: Neural Radiance Fields from One or Few Images

The paper "pixelNeRF: Neural Radiance Fields from One or Few Images" introduces an innovative approach to generating Neural Radiance Fields (NeRF) conditioned on single or sparse input views. The authors, Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa from UC Berkeley, propose a learning framework that significantly advances the current methodology by conditioning NeRF on input images using a fully convolutional manner.

Abstract and Methodology

The abstract succinctly outlines the core contribution of the paper—the introduction of pixelNeRF, a novel architecture enabling the prediction of NeRFs using one or few posed images. Unlike traditional methods requiring extensive calibrated views and substantial computational resources, pixelNeRF leverages spatial image features aligned to each pixel for conditioning, facilitating the training across multiple scenes to infer a scene prior.

Traditionally, NeRF renders photorealistic 3D views but mandates optimization per scene, a laborious process often requiring numerous input views. By integrating convolutional image features into the NeRF framework, pixelNeRF addresses these limitations, enabling efficient and generalized view synthesis directly from sparse image inputs. The architecture achieves this by computing a feature grid from the input image, sampling corresponding image features via projection and bilinear interpolation, and then combining these features with spatial coordinates and viewing directions to predict density and color at each point.

Key Contributions and Technical Innovations

Key contributions of pixelNeRF include:

Image-conditioned NeRF: By conditioning NeRF on pixel-aligned image features, the proposed framework allows for effective scene representation from sparse views.
Flexibility in View Synthesis: The architecture supports variability in the number of input views, accommodating single to multiple view inputs without necessitating test-time optimization.
Fully Convolutional Encoder: Utilizing a ResNet34-based encoder, pixelNeRF maintains spatial alignment between image features and 3D space, critical for detailed image reconstruction.
Hierarchical Volume Sampling: The model integrates hierarchical sampling mechanisms to enhance rendering efficiency, contributing to finer detail reconstruction.

Evaluation and Experimental Results

The authors conduct exhaustive experiments across various datasets, demonstrating pixelNeRF's prowess in multiple configurations:

ShapeNet Benchmarks: Both category-specific and category-agnostic models were evaluated for view synthesis, revealing significant improvements over state-of-the-art methods like SRN and DVR in terms of PSNR, SSIM, and LPIPS metrics.
Unseen Categories and Multi-object Scenes: The framework's ability to generalize to novel categories and handle multi-object scenes was evaluated, showcasing meaningful reconstruction even when applied to unseen object instances.
Real-world Scenarios: Application to the DTU dataset illustrates pixelNeRF’s capability in handling real scenes, achieving plausible novel views from sparse real images, a prominent challenge in NeRF methodologies.

Quantitative results highlight the robustness of the proposed approach:

PixelNeRF consistently achieves higher PSNR (26.80) and SSIM (0.910) metrics, outperforming SRN (23.28 PSNR, 0.849 SSIM) and DVR across all categories.
For real-world datasets like DTU, the model efficiently generates novel views from input sets as minimal as three images, outstripping traditional per-scene optimized NeRFs.

Implications and Future Directions

The contributions of pixelNeRF span both theoretical and practical realms in neural rendering:

Theoretical Implications: The introduction of pixel-aligned feature conditioning in NeRF architecture marks a significant theoretical advancement in neural scene representation, contributing to generalized learning across diverse scene categories.
Practical Applications: Practically, pixelNeRF’s ability to efficiently synthesize novel views from limited inputs has far-reaching implications for real-time applications in augmented reality, 3D virtual environments, and robotic vision.

Future directions may include:

Efficiency Improvements: Further research could focus on enhancing the runtime efficiency of the model, addressing the linear increase in computational demand with the number of input views.
Scale-invariant Induction: Developing strategies for scale-invariant NeRF models to automatically adjust for varying scene scales and positional encodings.
Broader Dataset Applications: Extending pixelNeRF to handle unstructured, in-the-wild datasets for more generalized real-world applications remains an exciting frontier.

In conclusion, pixelNeRF significantly pushes the envelope in neural radiance field generation, charting a path for future innovations in efficient, generalizable, and practical 3D scene reconstruction methodologies.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos