Single-View View Synthesis with Multiplane Images (2004.11364v1)

Published 23 Apr 2020 in cs.CV and cs.GR

Abstract: A recent strand of work in view synthesis uses deep learning to generate multiplane images (a camera-centric, layered 3D representation) given two or more input images at known viewpoints. We apply this representation to single-view view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers. Project page at https://single-view-mpi.github.io/.

Citations (308)

View on Semantic Scholar

Summary

The paper introduces a novel deep learning model that generates multiplane images from a single RGB input, overcoming the limitations of multi-view methods.
It employs scale-invariance and sparse depth supervision with an edge-aware smoothness loss to produce sharp depth maps and effectively handle occlusions.
Quantitative evaluations on datasets like KITTI and RealEstate10K demonstrate improved perceptual quality using metrics such as LPIPS, PSNR, and SSIM.

Evaluation of Single-View View Synthesis with Multiplane Images

The paper, "Single-View View Synthesis with Multiplane Images," presents a novel approach to view synthesis from a single RGB image, utilizing multiplane image (MPI) representation. The work is rooted in recent advancements in deep learning that have successfully employed MPIs for view synthesis tasks involving multiple images. The authors extend this approach to the more challenging single-view scenario, which necessitates predicting depth and handling occlusions without the multiview advantages of stereo or multiple camera systems.

Methodological Innovations

The core contribution lies in the introduction of a deep learning model capable of generating MPI from a solitary image input. Distinctive from prior methods, which managed MPI estimation using multiple views, this research leverages the untapped potential of single-view input. Furthermore, the authors introduce a scale-invariant approach to view synthesis that specifically addresses the global scale ambiguity inherently present in single-image inputs. By employing sparse point sets derived during training data generation, the model can circumvent the necessity of absolute scale, favoring instead a robust inference under ambiguous conditions.

The model incorporates an edge-aware smoothness loss to refine the depth map quality derived from predicted MPIs. This loss discourages over-smoothing of depth boundaries, promoting sharpness and alignment with visible object boundaries. An innovative use of sparse depth supervision is also adopted, adding another layer of detail and accuracy to the depth prediction without relying heavily on explicit dense supervision.

Quantitative and Qualitative Outcomes

The authors validate their approach using a range of datasets, including RealEstate10K and KITTI, among others. The evaluation focuses on multiple metrics such as LPIPS, PSNR, and SSIM, comparing against several baselines and alternate methods, consistently showing favorable results. Noteworthy is the model's ability to synthesize plausible new views, maintaining perceptual quality even in the face of substantial camera motion between source and target frames. The method also demonstrates versatility across different datasets and conditions, reinforcing the general applicability of the proposed approach.

On the iBims-1 benchmark for depth estimation, the model achieves performance levels comparable to those of methods heavily relying on explicit depth training data. This underscores the effectiveness of their sparse depth and scale-invariant approach. Importantly, the model's ability to predict and utilize background content is evidenced by improved results in areas of disocclusion, a common challenge in view synthesis tasks.

Implications and Future Directions

The demonstrated capabilities of single-view MPI generation suggest potential practical applications in scenarios where multiple view inputs are infeasible, such as handheld device photography or video applications. By relaxing the requirement for multiple input views, the approach opens up new avenues for computational photography and AR/VR applications where dynamic single-view reconstructions are desirable.

The paper also hints at future research avenues, particularly in enhancing MPI's ability to handle more extensive inpainting behind occlusions. The synergy of the presented method with adversarial loss functions remains an enticing prospect, possibly leading to even more realistic single-view synthesis outcomes.

In conclusion, this research represents a significant step towards understanding and expanding the capabilities of view synthesis from single images, providing both a technical framework and a prospective pathway for future explorations in computational imaging and related fields.

PDF Markdown