Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Published 24 Oct 2024 in cs.CV | (2410.18830v2)

Abstract: Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas due to the lack of guidance on the global image layout. This paper introduces the Multi-Scale Diffusion (MSD), an optimized framework that extends the panoramic image generation framework to multiple resolution levels. Our method leverages gradient descent techniques to incorporate structural information from low-resolution images into high-resolution outputs. Through comprehensive qualitative and quantitative evaluations against prior work, we demonstrate that our approach significantly improves the coherence of high-resolution panorama generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Blended latent diffusion. ACM transactions on graphics (TOG), 42(4): 1–11.
  2. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18208–18218.
  3. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In International Conference on Machine Learning.
  4. Demystifying mmd gans. arXiv preprint arXiv:1801.01401.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22563–22575.
  6. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
  7. Demofusion: Democratising high-resolution image generation with no $$$. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6159–6168.
  8. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models. arXiv preprint arXiv:2311.13141.
  9. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731.
  10. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
  13. Video diffusion models. Advances in Neural Information Processing Systems, 35: 8633–8646.
  14. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, 13916–13932. PMLR.
  15. Jiménez, Á. B. 2023. Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412.
  16. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401–4410.
  17. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110–8119.
  18. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  19. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11): 3964–3979.
  20. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36: 50648–50660.
  21. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 300–309.
  22. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
  23. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177.
  24. Improved denoising diffusion probabilistic models. In International conference on machine learning, 8162–8171. PMLR.
  25. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
  26. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  28. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
  30. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479–36494.
  31. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
  32. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
  33. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
  34. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  35. Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems, 36: 1363–1389.
  36. Neural discrete representation learning. Advances in neural information processing systems, 30.
  37. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20908–20918.
  38. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 1720–1733.
  39. Diffcollage: Parallel generation of large content with diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10188–10198. IEEE.
  40. HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions. arXiv preprint arXiv:2407.15187.
  41. TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models. arXiv preprint arXiv:2404.19475.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.