View Synthesis by Appearance Flow (1605.03557v3)

Published 11 May 2016 in cs.CV

Abstract: We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the same instance is highly correlated, and such correlation could be explicitly learned by training a convolutional neural network (CNN) to predict appearance flows -- 2-D coordinate vectors specifying which pixels in the input view could be used to reconstruct the target view. Furthermore, the proposed framework easily generalizes to multiple input views by learning how to optimally combine single-view predictions. We show that for both objects and scenes, our approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.

Authors (5)

Tinghui Zhou (14 papers)
Shubham Tulsiani (71 papers)
Weilun Sun (1 paper)
Jitendra Malik (211 papers)
Alexei A. Efros (100 papers)

Citations (673)

View on Semantic Scholar

Summary

The paper introduces a CNN method that predicts appearance flow fields to synthesize new views without generating pixels from scratch.
It leverages an encoder-decoder architecture that combines input view features with viewpoint transformations to guide pixel mapping.
Experimental results on ShapeNet and KITTI datasets show lower pixel error rates and superior perceptual quality compared to prior methods.

View Synthesis by Appearance Flow

This paper presents an innovative approach to the problem of novel view synthesis, specifically focusing on synthesizing new images of objects or scenes from arbitrary viewpoints based on a single input image. The authors propose a novel strategy that leverages appearance flow rather than synthesizing pixels from scratch, utilizing a Convolutional Neural Network (CNN) to predict appearance flows—2-D coordinate vectors that map pixels from the input image to the target image. This approach capitalizes on the inherent correlation between different views of the same object and allows for higher perceptual quality in the synthesized views compared to traditional CNN-based methods.

Methodology

The core idea is to predict an appearance flow field instead of generating raw pixel values for the novel view. Specifically, the network predicts where each pixel in the target view should come from in the input view, using a process akin to texture synthesis but driven by learned geometric transformations. The model employs a deep generative convolutional encoder-decoder framework. This architecture contains:

Input View Encoder: Extracts features such as color, pose, and texture from the query instance.
Viewpoint Transformation Encoder: Maps the desired viewpoint transformation to a higher-dimensional space.
Synthesis Decoder: Combines the features from the encoders to predict the appearance flow field, which is used to reconstruct the target view using pixels from the input view.

The method is extended to accommodate multiple input views, allowing the model to predict using a confidence weighting scheme to combine predictions from individual views optimally.

Experimental Results

The authors evaluate their approach on synthetic datasets (e.g., ShapeNet) and real-world scenes (e.g., KITTI dataset), showing significant improvements over prior pixel-based methods, particularly in preserving high-frequency details such as textures and edge boundaries. Quantitatively, the model demonstrates lower mean pixel $L_1$ error rates across various benchmarks. The paper also includes perceptual evaluations, where human annotators preferred the results of this method over the baseline in an overwhelming number of cases.

Implications and Future Directions

The approach demonstrates considerable applicability to domains such as computer graphics and virtual reality, where high-fidelity image synthesis from diverse perspectives is crucial. By leveraging the appearance flow, this method enhances detail preservation and structural integrity in synthesized views.

However, the technique is limited to reusing existing pixel information, which restricts its ability to hallucinate unseen content. Future research could explore combining this method with pixel generation techniques to balance detail preservation with creative extrapolation capabilities. Additionally, improving the model's ability to learn long-range dependencies in appearance flow could enhance its performance further.

The authors also highlight the need for robust datasets and metrics in the view synthesis arena, which would provide a better standard for comparing different methodologies.

Overall, this paper contributes a significant methodological innovation in view synthesis by reimagining the pixel generation challenge as an appearance flow prediction task, yielding results that substantially advance the field's state of the art.

PDF Markdown

Related Papers

Tweets

https://twitter.com/PapersTrending/status/1235883368862158850

https://twitter.com/PapersTrending/status/1235521035098877952