Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation
Abstract: Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas due to the lack of guidance on the global image layout. This paper introduces the Multi-Scale Diffusion (MSD), an optimized framework that extends the panoramic image generation framework to multiple resolution levels. Our method leverages gradient descent techniques to incorporate structural information from low-resolution images into high-resolution outputs. Through comprehensive qualitative and quantitative evaluations against prior work, we demonstrate that our approach significantly improves the coherence of high-resolution panorama generation.
- Blended latent diffusion. ACM transactions on graphics (TOG), 42(4): 1–11.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18208–18218.
- MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In International Conference on Machine Learning.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22563–22575.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
- Demofusion: Democratising high-resolution image generation with no $$$. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6159–6168.
- Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models. arXiv preprint arXiv:2311.13141.
- Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731.
- Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
- Video diffusion models. Advances in Neural Information Processing Systems, 35: 8633–8646.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, 13916–13932. PMLR.
- Jiménez, Á. B. 2023. Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401–4410.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110–8119.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11): 3964–3979.
- Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36: 50648–50660.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 300–309.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
- Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177.
- Improved denoising diffusion probabilistic models. In International conference on machine learning, 8162–8171. PMLR.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479–36494.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
- Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems, 36: 1363–1389.
- Neural discrete representation learning. Advances in neural information processing systems, 30.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20908–20918.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 1720–1733.
- Diffcollage: Parallel generation of large content with diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10188–10198. IEEE.
- HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions. arXiv preprint arXiv:2407.15187.
- TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models. arXiv preprint arXiv:2404.19475.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.