Taming Stable Diffusion for Text to 360° Panorama Image Generation (2404.07949v1)
Abstract: Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.
- Diverse plausible 360-degree image outpainting for efficient 3dcg background creation. In CVPR, pages 11441–11450, 2022.
- Multidiffusion: Fusing diffusion paths for controlled image generation. In ICML, pages 1737–1752. PMLR, 2023.
- Matterport3d: Learning from rgb-d data in indoor environments. 3DV, 2017.
- Text2light: Zero-shot text-driven hdr panorama generation. ACM TOG, 41(6):1–16, 2022.
- Guided co-modulated gan for 360° field of view extrapolation. In 3DV, pages 475–485. IEEE, 2022.
- Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
- Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
- Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. arXiv preprint arXiv:2310.03602, 2023.
- Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pages 6626–6637, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
- Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, pages 7909–7920, 2023.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Elucidating the design space of diffusion-based generative models. In NeurIPS, pages 26565–26577, 2022.
- Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
- Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. In NeurIPS, 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023.
- Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, pages 5775–5787, 2022.
- Autoregressive omni-aware outpainting for open-vocabulary 360-degree image generation. arXiv preprint arXiv:2309.03467, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Guided image synthesis via initial image editing in diffusion model. In ACM MM, pages 5321–5329. ACM, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
- Bips: Bi-modal indoor panorama synthesis via residual depth-aided adversarial learning. In ECCV, pages 352–371. Springer, 2022.
- High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration. In WACV, pages 3116–3125, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- 360monodepth: High-resolution 360deg monocular depth estimation. In CVPR, pages 3762–3772, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- RunwayML. Stable diffusion. https://github.com/runwayml/stable-diffusion, 2021.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022b.
- Improved techniques for training gans. In NeurIPS, pages 2226–2234, 2016.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, pages 25278–25294, 2022.
- Conditional 360-degree image synthesis for immersive indoor scene decoration. In ICCV, pages 4478–4488, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2021.
- Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In ACM MM, pages 6898–6906. ACM, 2023.
- Generative modeling by estimating gradients of the data distribution. In NeurIPS, pages 11895–11907, 2019.
- Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In CVPR, pages 1047–1056, 2019.
- Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
- Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In NeurIPS, 2023.
- Consistent view synthesis with pose-guided diffusion models. In CVPR, pages 16773–16783, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Layoutmp3d: Layout annotation of matterport3d. arXiv preprint arXiv:2003.13516, 2020.
- Stylelight: Hdr panorama generation for lighting estimation and editing. In ECCV, pages 477–492. Springer, 2022.
- Perf: Panoramic neural radiance field from a single panorama. arXiv preprint arXiv:2310.16831, 2023a.
- Customizing 360-degree panoramas through text-to-image diffusion models. In WACV, 2024.
- 360-degree panorama generation from few unregistered nfov images. In ACM MM, pages 6811–6821. ACM, 2023b.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023c.
- Ipo-ldm: Depth-aided 360-degree indoor rgb panorama outpainting via latent diffusion model. arXiv preprint arXiv:2307.03177, 2023.
- Layout-guided novel view synthesis from a single indoor panorama. In CVPR, pages 16438–16447, 2021.
- State-of-the-art in 360 video/image processing: Perception, assessment and compression. JSTSP, 14(1):5–26, 2020.
- A survey of scene understanding by event reasoning in autonomous driving. MIR, 15(3):249–266, 2018.
- Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM TOG, 41(4):1–10, 2022a.
- Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. arXiv preprint arXiv:2310.13119, 2023.
- Diffusion models: A comprehensive survey of methods and applications. ACM CSUR, 2022b.
- Long-term photometric consistent novel view synthesis with diffusion models. In ICCV, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023a.
- Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023b.
- Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In AAAI, pages 3653–3661, 2022.
- Manhattan room layout reconstruction from a single 360 image: A comparative study of state-of-the-art methods. IJCV, 129:1410–1431, 2021.
- Cheng Zhang (389 papers)
- Qianyi Wu (29 papers)
- Camilo Cruz Gambardella (9 papers)
- Xiaoshui Huang (55 papers)
- Dinh Phung (148 papers)
- Wanli Ouyang (359 papers)
- Jianfei Cai (163 papers)