Neural Rerendering in the Wild (1904.04290v1)

Published 8 Apr 2019 in cs.CV and cs.GR

Abstract: We explore total scene capture -- recording, modeling, and rerendering a scene under varying appearance such as season and time of day. Starting from internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud. For each photo, we render the scene points into a deep framebuffer, and train a neural network to learn the mapping of these initial renderings to the actual photos. This rerendering network also takes as input a latent appearance vector and a semantic mask indicating the location of transient objects like pedestrians. The model is evaluated on several datasets of publicly available images spanning a broad range of illumination conditions. We create short videos demonstrating realistic manipulation of the image viewpoint, appearance, and semantic labeling. We also compare results with prior work on scene reconstruction from internet photos.

Citations (208)

View on Semantic Scholar

Summary

The paper introduces a innovative neural rerendering framework that decomposes scenes using a factored 3D representation to handle dynamic conditions from unstructured photo collections.
It employs a two-stage process with SfM/MVS-based point cloud reconstruction and a deferred-shading neural network to interpolate and extrapolate scene appearances.
Experimental results show enhanced rendering fidelity and versatility, signaling future applications in virtual tourism, gaming, and augmented reality.

Overview of "Neural Rerendering in the Wild"

The paper "Neural Rerendering in the Wild" by Meshry et al. addresses the challenge of total scene capture. This involves recording and rerendering a scene under different appearance conditions such as weather variations, daylight changes, and including transient objects like pedestrians. The process uses community-contributed photos of famous tourist landmarks, leveraging these unstructured collections to generate highly realistic scene renderings. Through an innovative neural rerendering framework, the authors aim to bridge the gap between the varied captures of a scene found in public image datasets and the consistent realism of photographs.

Methodological Approach

The paper introduces a two-component approach for total scene capture:

Factored Representation: The model begins with a 3D reconstruction of the scene using Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques, which approximates the scene as a point cloud. This representation disaggregates images into elements of viewpoint, appearance, and transient objects.
Rerendering Framework: A neural rerendering network is developed to synthesize new scene images from the point cloud data. The network inputs a latent appearance vector and a deferred-shading deep buffer generated from the point cloud, enabling it to produce output images with various appearances.

The proposed model utilizes a staged training approach which includes an innovative pretraining of the appearance encoder on a proxy task. This task learns to encode appearance variations using a style-based loss, independent of viewpoint content, enhancing the model's ability to handle complex appearance variability.

Experimental Results

The proposed framework is evaluated across multiple datasets depicting globally-recognized landmarks under diverse illumination conditions. The authors demonstrate the model’s capacity to interpolate and extrapolate scene appearances effectively. This capability is quantitatively assessed against existing methodologies, indicating improvements in rendering fidelity and appearance versatility.

The paper highlights impressive results with view and appearance interpolation showed in video demonstrations. It contrasts the output to previous models which primarily used static illumination conditions, arguing for the improvements in realistic scene capture and dynamic rendition.

Practical and Theoretical Implications

Practically, the model holds significant potential for applications in virtual tourism, gaming, and augmented reality where realistic scene portrayal under varying conditions is paramount. Theoretically, the work challenges preconceived limitations of using uncontrolled datasets for detailed reconstruction tasks, emphasizing the capacity of neural networks to distill substantial information even from noisy and incomplete data.

Future Developments

The research suggests several future avenues, including refining semantic labeling to mitigate the current limitations related to segmentation inaccuracies. Moreover, improving the temporal coherence of renderings can advance applications requiring smooth visual transitions. Further exploration of latent spaces could unlock even more intricate appearance interpolation capabilities.

In conclusion, the paper "Neural Rerendering in the Wild" provides a robust methodology to tackle the complexities of total scene capture using the abundant, though irregular, data represented by community photos. The innovative use of neural networks to produce high-fidelity scene renderings from in-the-wild images is a strong contribution to the field of computer vision and graphics.

PDF Markdown

Related Papers

YouTube

Show All Videos