Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming Stable Diffusion for Text to 360° Panorama Image Generation (2404.07949v1)

Published 11 Apr 2024 in cs.CV

Abstract: Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Diverse plausible 360-degree image outpainting for efficient 3dcg background creation. In CVPR, pages 11441–11450, 2022.
  2. Multidiffusion: Fusing diffusion paths for controlled image generation. In ICML, pages 1737–1752. PMLR, 2023.
  3. Matterport3d: Learning from rgb-d data in indoor environments. 3DV, 2017.
  4. Text2light: Zero-shot text-driven hdr panorama generation. ACM TOG, 41(6):1–16, 2022.
  5. Guided co-modulated gan for 360° field of view extrapolation. In 3DV, pages 475–485. IEEE, 2022.
  6. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  7. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
  8. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602, 2023.
  9. Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133, 2023.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pages 6626–6637, 2017.
  12. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  13. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, pages 7909–7920, 2023.
  14. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  15. Elucidating the design space of diffusion-based generative models. In NeurIPS, pages 26565–26577, 2022.
  16. Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
  17. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. In NeurIPS, 2023.
  18. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023.
  19. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, pages 5775–5787, 2022.
  20. Autoregressive omni-aware outpainting for open-vocabulary 360-degree image generation. arXiv preprint arXiv:2309.03467, 2023.
  21. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  22. Guided image synthesis via initial image editing in diffusion model. In ACM MM, pages 5321–5329. ACM, 2023.
  23. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020.
  24. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  25. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
  26. Bips: Bi-modal indoor panorama synthesis via residual depth-aided adversarial learning. In ECCV, pages 352–371. Springer, 2022.
  27. High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration. In WACV, pages 3116–3125, 2023.
  28. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  29. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  30. 360monodepth: High-resolution 360deg monocular depth estimation. In CVPR, pages 3762–3772, 2022.
  31. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  32. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  33. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  34. RunwayML. Stable diffusion. https://github.com/runwayml/stable-diffusion, 2021.
  35. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022a.
  36. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022b.
  37. Improved techniques for training gans. In NeurIPS, pages 2226–2234, 2016.
  38. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  39. LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, pages 25278–25294, 2022.
  40. Conditional 360-degree image synthesis for immersive indoor scene decoration. In ICCV, pages 4478–4488, 2023.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  42. Denoising diffusion implicit models. In ICLR, 2021.
  43. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In ACM MM, pages 6898–6906. ACM, 2023.
  44. Generative modeling by estimating gradients of the data distribution. In NeurIPS, pages 11895–11907, 2019.
  45. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In CVPR, pages 1047–1056, 2019.
  46. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  47. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In NeurIPS, 2023.
  48. Consistent view synthesis with pose-guided diffusion models. In CVPR, pages 16773–16783, 2023.
  49. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  50. Layoutmp3d: Layout annotation of matterport3d. arXiv preprint arXiv:2003.13516, 2020.
  51. Stylelight: Hdr panorama generation for lighting estimation and editing. In ECCV, pages 477–492. Springer, 2022.
  52. Perf: Panoramic neural radiance field from a single panorama. arXiv preprint arXiv:2310.16831, 2023a.
  53. Customizing 360-degree panoramas through text-to-image diffusion models. In WACV, 2024.
  54. 360-degree panorama generation from few unregistered nfov images. In ACM MM, pages 6811–6821. ACM, 2023b.
  55. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023c.
  56. Ipo-ldm: Depth-aided 360-degree indoor rgb panorama outpainting via latent diffusion model. arXiv preprint arXiv:2307.03177, 2023.
  57. Layout-guided novel view synthesis from a single indoor panorama. In CVPR, pages 16438–16447, 2021.
  58. State-of-the-art in 360 video/image processing: Perception, assessment and compression. JSTSP, 14(1):5–26, 2020.
  59. A survey of scene understanding by event reasoning in autonomous driving. MIR, 15(3):249–266, 2018.
  60. Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM TOG, 41(4):1–10, 2022a.
  61. Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. arXiv preprint arXiv:2310.13119, 2023.
  62. Diffusion models: A comprehensive survey of methods and applications. ACM CSUR, 2022b.
  63. Long-term photometric consistent novel view synthesis with diffusion models. In ICCV, 2023.
  64. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023a.
  65. Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023b.
  66. Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In AAAI, pages 3653–3661, 2022.
  67. Manhattan room layout reconstruction from a single 360 image: A comparative study of state-of-the-art methods. IJCV, 129:1410–1431, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Cheng Zhang (389 papers)
  2. Qianyi Wu (29 papers)
  3. Camilo Cruz Gambardella (9 papers)
  4. Xiaoshui Huang (55 papers)
  5. Dinh Phung (148 papers)
  6. Wanli Ouyang (359 papers)
  7. Jianfei Cai (163 papers)
Citations (4)

Summary

  • The paper introduces PanFusion, a dual-branch diffusion model that overcomes data scarcity and geometric gaps to generate coherent 360° panoramic images from text.
  • It employs a novel Equirectangular-Perspective Projection Attention mechanism that fuses global panoramic and local perspective views for enhanced visual realism.
  • Robust training with minimal data and LoRA fine-tuning significantly improves performance metrics, promising advances in AR/VR, architectural visualization, and beyond.

A Review and Analysis of "Taming Stable Diffusion for Text to 360° Panorama Image Generation"

The paper "Taming Stable Diffusion for Text to 360° Panorama Image Generation" by Cheng Zhang et al. explores a novel approach for generating 360-degree panoramic images from textual prompts, an area that presents significant computational challenges due to data scarcity and the complex geometric transformations involved. The paper introduces a dual-branch diffusion model named PanFusion to improve the quality and consistency of panoramic image generation using the capabilities of generative models like Stable Diffusion.

Key Contributions

The paper identifies two primary obstacles in generating panoramic images: limited paired data for text-to-panorama generation and the geometric domain gap between panoramic and traditional perspective images. To address these, the authors propose a novel architecture that leverages a dual-branch model, consisting of a global panorama branch and a local perspective branch. This dual-branch setup aims to exploit the textured richness of perspective imagery while maintaining the global coherence necessary for panoramas.

  1. Dual-Branch Diffusion Architecture: PanFusion integrates two branches, each fine-tuned for their specific strengths—one providing the panoramic "canvas" and the other focusing on multiview perspective images. This division allows the model to harness prior knowledge from Stable Diffusion within the perspective domain and adapt it for panoramic generation.
  2. Equirectangular-Perspective Projection Attention (EPPA): This novel attention mechanism helps maintain geometric integrity and fosters a synergy between global and local branches. It respects the equirectangular requirements of panoramic imagery, ensuring consistency across different views.
  3. Robust Training and Fine-Tuning: By leveraging pre-existing trained models with minimal data and using LoRA for parameter adaptation, the paper demonstrates an approach that conservatively extends the capabilities of existing models to work effectively under data-scarce conditions.

Experimental Evaluation and Implications

The paper provides thorough experimental results demonstrating that the proposed PanFusion framework surpasses existing methods in generating text-to-image panoramas. It addresses challenges such as visual inconsistency and error propagation commonly found in previous models like MVDiffusion. The dual-branch approach significantly enhances performance, with PanFusion showing superior results in terms of realism (measured by Fréchet Auto-Encoder Distance) and global scene coherence.

  • Layout-Conditioned Generation: An impressive aspect of PanFusion is its adaptability to specific layout conditions, making it particularly suitable for applications requiring precise spatial configurations, such as virtual tour scenarios, environmental lighting setups, or AR/VR applications.
  • Potential for Broader Application: Although the current scope focuses on indoor environments, PanFusion's ability to generate high-fidelity panoramic images suggests future possible extensions into diverse sectors like gaming, architecture, and autonomous vehicle systems, where environmental mapping is crucial.

Future Directions

While PanFusion marks a significant step forward, challenges remain. The computational complexity due to the dual-branch model requires further optimization, particularly for real-time applications. Extension to broader scene types, including outdoor and highly dynamic scenes, would further validate the model's robustness. Future work could also explore integrating more sophisticated semantic control over the generated images to tailor the outputs to more specific user needs or domain requirements.

This research contributes a substantial advancement in the field of panoramic imaging from textual descriptions, providing a framework that balances the technical limitations of current models with innovative architectural solutions.