Emergent Mind

SPAD : Spatially Aware Multiview Diffusers

(2402.05235)
Published Feb 7, 2024 in cs.CV

Abstract

We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad

SPAD synthesizes consistent 3D images from text, from different angles, with minimal training views.

Overview

  • SPAD introduces a method to synthesize consistent multi-view images using diffusion models, enhancing 3D image generation by incorporating cross-view interactions and spatial coordinates.

  • The introduction of Epipolar Attention refines the model's understanding of spatial relationships, improving camera control and 3D consistency across generated multi-view images.

  • Plücker Embeddings are integrated to allow precise maintenance of object positions and orientations, solving previous issues like view flipping in multi-view image generation.

  • SPAD's ability to generate high-quality, 3D-consistent images from textual and image inputs has significant implications for virtual reality, gaming, and 3D modeling, showcasing superior performance in evaluations.

Introduction

In the age of rapidly advancing generative models, the development of techniques capable of understanding and interpreting the three-dimensional structure from textual or image inputs has gained paramount importance. The paper "SPAD: Spatially Aware Multi-View Diffusers" introduces a groundbreaking method that leverages advancements in diffusion models (DMs) to synthesize consistent multi-view images. By repurposing and extending pre-trained 2D DMs, specifically through modifications in self-attention layers to incorporate cross-view interactions and employing both epipolar geometry and Plücker coordinates, SPAD achieves significant leaps in 3D image generation fidelity and consistency.

Methodology

Epipolar Attention for Enhanced Camera Control

At the heart of SPAD lies the introduction of Epipolar Attention, a technique designed to refine the model’s understanding of spatial relationships between multi-view images. By constraining the feature map positions to engage across views only along their epipolar lines, SPAD manages to address and significantly reduce the content-copying dilemma faced by previous models. This enhancement not only bolsters camera control, allowing for the generation of images from novel and diverse viewpoints but also remarkably improves the 3D consistency of the generated outputs.

Plücker Embeddings for Spatial Reasoning

Building upon the integration of epipolar geometry, SPAD further enhances its spatial awareness by incorporating Plücker coordinates as positional embeddings within its architecture. This ingenious adaptation allows the model to accurately discern and maintain the consistency of object positions and orientations across varying camera views, effectively mitigating issues such as view flipping that plagued earlier approaches.

Performance and Evaluation

Extensive experiments demonstrate SPAD’s superiority in generating high-quality, 3D-consistent multi-view images from both textual and image-based inputs. The model showcases exceptional ability in understanding and interpreting relative camera poses, translating them into spatially coherent images that align perfectly with the specified viewpoints. Notably, SPAD achieves this feat without compromising the individual quality of generated images, maintaining impressive metrics across various evaluation standards including PSNR, SSIM, and LPIPS scores.

Applications and Implications

The implications of SPAD’s innovations extend well beyond the immediate realm of image generation. Its capability to create accurate, high-fidelity multi-view images from sparse inputs opens new avenues in virtual reality, gaming, and 3D modeling, significantly reducing the resource and time expenditure typically associated with these tasks. Furthermore, SPAD's text-to-3D generation capabilities, demonstrated through both the multi-view Score Distillation Sampling and a triplane generator, exemplify its potential in streamlining the content creation pipeline, enabling rapid generation of complex 3D assets directly from textual descriptions.

Concluding Thoughts

SPAD represents a significant leap forward in the automatization and enhancement of 3D content generation. By equipping diffusion models with an acute spatial awareness and a deeper understanding of the geometry that defines our visual world, it sets a new benchmark for future research in the field. While acknowledging its current limitations, the promising direction and robust performance of SPAD pave the way for further explorations and advancements in generative AI, bringing us closer to capturing the richness and complexity of the three-dimensional world.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.