Taming Stable Diffusion for Text to 360° Panorama Image Generation (2404.07949v1)

Published 11 Apr 2024 in cs.CV

Abstract: Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.

References (67)

Authors (7)

Cheng Zhang (389 papers)
Qianyi Wu (29 papers)
Camilo Cruz Gambardella (9 papers)
Xiaoshui Huang (55 papers)
Dinh Phung (148 papers)
Wanli Ouyang (359 papers)
Jianfei Cai (163 papers)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces PanFusion, a dual-branch diffusion model that overcomes data scarcity and geometric gaps to generate coherent 360° panoramic images from text.
It employs a novel Equirectangular-Perspective Projection Attention mechanism that fuses global panoramic and local perspective views for enhanced visual realism.
Robust training with minimal data and LoRA fine-tuning significantly improves performance metrics, promising advances in AR/VR, architectural visualization, and beyond.

A Review and Analysis of "Taming Stable Diffusion for Text to 360° Panorama Image Generation"

The paper "Taming Stable Diffusion for Text to 360° Panorama Image Generation" by Cheng Zhang et al. explores a novel approach for generating 360-degree panoramic images from textual prompts, an area that presents significant computational challenges due to data scarcity and the complex geometric transformations involved. The paper introduces a dual-branch diffusion model named PanFusion to improve the quality and consistency of panoramic image generation using the capabilities of generative models like Stable Diffusion.

Key Contributions

The paper identifies two primary obstacles in generating panoramic images: limited paired data for text-to-panorama generation and the geometric domain gap between panoramic and traditional perspective images. To address these, the authors propose a novel architecture that leverages a dual-branch model, consisting of a global panorama branch and a local perspective branch. This dual-branch setup aims to exploit the textured richness of perspective imagery while maintaining the global coherence necessary for panoramas.

Dual-Branch Diffusion Architecture: PanFusion integrates two branches, each fine-tuned for their specific strengths—one providing the panoramic "canvas" and the other focusing on multiview perspective images. This division allows the model to harness prior knowledge from Stable Diffusion within the perspective domain and adapt it for panoramic generation.
Equirectangular-Perspective Projection Attention (EPPA): This novel attention mechanism helps maintain geometric integrity and fosters a synergy between global and local branches. It respects the equirectangular requirements of panoramic imagery, ensuring consistency across different views.
Robust Training and Fine-Tuning: By leveraging pre-existing trained models with minimal data and using LoRA for parameter adaptation, the paper demonstrates an approach that conservatively extends the capabilities of existing models to work effectively under data-scarce conditions.

Experimental Evaluation and Implications

The paper provides thorough experimental results demonstrating that the proposed PanFusion framework surpasses existing methods in generating text-to-image panoramas. It addresses challenges such as visual inconsistency and error propagation commonly found in previous models like MVDiffusion. The dual-branch approach significantly enhances performance, with PanFusion showing superior results in terms of realism (measured by Fréchet Auto-Encoder Distance) and global scene coherence.

Layout-Conditioned Generation: An impressive aspect of PanFusion is its adaptability to specific layout conditions, making it particularly suitable for applications requiring precise spatial configurations, such as virtual tour scenarios, environmental lighting setups, or AR/VR applications.
Potential for Broader Application: Although the current scope focuses on indoor environments, PanFusion's ability to generate high-fidelity panoramic images suggests future possible extensions into diverse sectors like gaming, architecture, and autonomous vehicle systems, where environmental mapping is crucial.

Future Directions

While PanFusion marks a significant step forward, challenges remain. The computational complexity due to the dual-branch model requires further optimization, particularly for real-time applications. Extension to broader scene types, including outdoor and highly dynamic scenes, would further validate the model's robustness. Future work could also explore integrating more sophisticated semantic control over the generated images to tailor the outputs to more specific user needs or domain requirements.

This research contributes a substantial advancement in the field of panoramic imaging from textual descriptions, providing a framework that balances the technical limitations of current models with innovative architectural solutions.

Related Papers

GitHub

Tweets

https://twitter.com/QianyiWu7/status/1879296248118927621

https://twitter.com/realmofresearch/status/1779868520295321797