EscherNet: A Generative Model for Scalable View Synthesis (2402.03908v2)

Published 6 Feb 2024 in cs.CV

Abstract: We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a novel multi-view conditioned diffusion model that synthesizes coherent target views with precise camera control.
It leverages a transformer architecture with Camera Positional Encoding to capture both 4 and 6 degrees of freedom in camera poses.
Empirical results demonstrate that EscherNet outperforms existing 3D diffusion models in generating consistent views from sparse inputs.

Overview

EscherNet introduces a paradigm shift in view synthesis with its aptly named multi-view conditioned diffusion model. By adeptly handling an arbitrary number of reference views to generate multiple consistent target views with precise camera control, this model marks a significant advancement in the field of 3D vision. Notable for its versatility, EscherNet can function with as little as a single reference view or scale up to numerous views for generating over 100 target views in unison. The flexibility and scalability demonstrated by EscherNet challenge the status quo of scene-specific, coordinate-dependent approaches.

Architecture and Methodology

At its core, EscherNet employs an image-to-image conditional diffusion model architecture based on the transformer paradigm, ensuring robustness through effective self and cross-attention mechanisms to maintain consistency across views. A pivotal innovation within EscherNet is its Camera Positional Encoding (CaPE) which elegantly captures both 4 (object-centric) and 6 degrees of freedom (DoF) for camera poses. This allows the model to factor in the relative camera transformations directly through dot-product attention, staying independent of specific coordinate systems.

Empirical Evaluation

Empirical evidence from extensive experiments reveals EscherNet’s proficiency in producing state-of-the-art results across diverse benchmarks. When compared against prominent scene-specific neural rendering and existing 3D diffusion models, EscherNet not only outperforms in terms of generative quality but also exhibits a remarkable ability to infer plausible view synthesis from sparse views. This aspect is vital for real-world applications where detailed 3D ground truth often remains elusive. Additionally, it outcompetes other models in tasks of single/few-image 3D reconstruction, suggesting a nuanced understanding of 3D structures from limited data.

Conclusion and Future Prospects

EscherNet’s strengths lie in its remarkable generality and functionality in generating coherent views in a zero-shot learning setting, offering up new directions for neural architecture design, particularly in 3D vision applications. It appears well-suited to democratize scalable view synthesis beyond specialized 3D datasets, potentially utilizing everyday posed 2D image data efficiently. The insights gleaned from EscherNet hold promising implications for the development of future models that may redefine the scope of multi-view synthesis and reconstruction in the AI domain.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1755086376033190369

https://twitter.com/javaeeeee1/status/1755386507957776584

https://twitter.com/javaeeeee1/status/1756664674106388545