MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis (2103.14910v3)

Published 27 Mar 2021 in cs.CV, cs.GR, and cs.LG

Abstract: In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents. The reconstructed and inpainted frustum can then be easily rendered into novel RGB or depth views using differentiable rendering. Extensive experiments on RealEstate10K, KITTI and Flowers Light Fields show that our MINE outperforms state-of-the-art by a large margin in novel view synthesis. We also achieve competitive results in depth estimation on iBims-1 and NYU-v2 without annotated depth supervision. Our source code is available at https://github.com/vincentfung13/MINE

Citations (140)

View on Semantic Scholar

Summary

The paper introduces MINE, which fuses MPI with NeRF to incorporate continuous depth for novel view synthesis from single images.
The method employs differentiable rendering and a ResNet-50 based encoder, significantly improving SSIM and LPIPS metrics across multiple datasets.
Experimental results on KITTI, RealEstate10K, and Flowers Light Fields demonstrate its robust scene reconstruction capabilities even without direct depth supervision.

MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis

The presented paper details "MINE," a methodology that integrates Multiplane Images (MPI) with Neural Radiance Fields (NeRF) to achieve novel view synthesis and depth estimation from a single image. This approach enhances existing techniques by incorporating continuous depth representation and differentiable rendering.

Methodology Overview

MINE modifies the traditional MPI by incorporating continuous depth through NeRF. The approach starts with a single RGB image, predicts a 4-channel image that includes RGB and volume density, and uses these predictions to reconstruct the camera frustum. The reconstructed frustum is then rendered into novel views. This is achieved by employing differentiable rendering techniques, enhancing the synthesis of unseen views and improving depth estimation without requiring annotated depth supervision.

Experimental Validation

The paper provides extensive empirical validation on several datasets such as RealEstate10K, KITTI, and Flowers Light Fields. The numerical results exhibit significant superiority over state-of-the-art methods in novel view synthesis:

KITTI Dataset: MINE achieved SSIM scores of 0.822, compared to 0.733 from the MPI model it builds upon.
RealEstate10K Dataset: Demonstrated improvements in LPIPS and SSIM across various configurations, notably outperforming contemporary methods like SynSin.
Flowers Light Fields: Further validated MINE's capacity to handle complex light field scenes, outperforming benchmarks set by competing methodologies.

The approach also yielded competitive results in depth estimation on datasets like iBims-1 and NYU-v2, despite the lack of direct depth annotations in the training data.

Technical Contributions

The paper introduces several technical contributions that bolster the performance of MINE:

Continuous Depth Representation: By generalizing MPI to a continuous rather than discrete depth framework using NeRF principles, MINE allows more flexible and precise depth specification.
Efficient Rendering: The approach efficiently synthesizes views using a fully convolutional network, requiring only a fixed number of inferences independent of the output resolution.
Robust Network Structure: The use of ResNet-50 as an encoder ensures potent feature extraction, while the decoder is specifically tailored for simultaneous RGB and volume density predictions, benefiting from extensive ImageNet pre-training.

Implications and Future Developments

MINE represents a significant step in the utility of single-image inputs for realistic scene reconstruction applications. The implications are profound for fields reliant on automated 3D scene reproduction, such as immersive AR/VR applications and robotics.

The continuous representation model offers a window for further research to refine rendering efficiency and output fidelity. Future work could focus on incorporating directional cues to enhance view-dependent effects, exploring broader LLMs in model architecture, or optimizing training across larger, more diverse datasets.

MINE sets a foundation for future advancements, particularly in scenarios requiring rapid adaptation to unseen environments with minimal input data.

PDF Markdown

Related Papers

GitHub

GitHub - vincentfung13/MINE: Code and models for our ICCV 2021 paper "MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis" (409 stars)