Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model (2403.10044v1)

Published 15 Mar 2024 in cs.CV

Abstract: Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Deep Learning for Omnidirectional Vision: A Survey and New Perspectives. arXiv preprint arXiv:2205.10468.
  2. Diverse plausible 360-degree image outpainting for efficient 3DCG background creation. In Proc. CVPR, 11441–11450.
  3. SpaText: Spatio-Textual Representation for Controllable Image Generation. arXiv preprint arXiv:2211.14305.
  4. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. arXiv preprint arXiv:2302.08113.
  5. Instructpix2pix: Learning to follow image editing instructions. In Proc. CVPR, 18392–18402.
  6. Exploring simple siamese representation learning. In Proc. CVPR, 15750–15758.
  7. Ranking consistency for image matching and object retrieval. Pattern Recognition, 47(3): 1349–1360.
  8. Eliminating the blind spot: Adapting 3d object detection and monocular depth estimation to 360 panoramic imagery. In Proc. ECCV, 789–807.
  9. Diffusion models beat gans on image synthesis. In Proc. NeurIPS, volume 34, 8780–8794.
  10. Spherical image generation from a single image by considering scene symmetry. In Proc. AAAI, volume 35, 1513–1521.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS, volume 30, 6629–6640.
  12. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  13. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, 1125–1134.
  14. Learning multi-level density maps for crowd counting. IEEE transactions on neural networks and learning systems, 31(8): 2705–2715.
  15. Imagic: Text-based real image editing with diffusion models. In Proc. CVPR, 6007–6017.
  16. ExtVision: augmentation of visual experiences with generation of context images for a peripheral vision using deep neural network. In Proc. CHI, 1–10.
  17. Leveraging off-the-shelf diffusion model for multi-attribute fashion image manipulation. In Proc. WACV.
  18. PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation. arXiv preprint arXiv:2305.19195.
  19. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proc. ICML.
  20. Graph mode-based contextual kernels for robust SVM tracking. In 2011 international conference on computer vision, 1156–1163. IEEE.
  21. SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation. In Proc. of IJCAI, 1125–1133.
  22. Densepass: Dense panoramic semantic segmentation via unsupervised domain adaptation with attention-augmented context exchange. In Proc. ITSC, 2766–2772. IEEE.
  23. Null-text inversion for editing real images using guided diffusion models. In Proc. CVPR, 6038–6047.
  24. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453.
  25. Generating images with sparse representations. arXiv preprint arXiv:2103.03841.
  26. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
  27. Learning transferable visual models from natural language supervision. In Proc. ICML, 8748–8763. PMLR.
  28. Zero-shot text-to-image generation. In Proc. ICML, 8821–8831. PMLR.
  29. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 10684–10695.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. CVPR, 22500–22510.
  31. Photorealistic text-to-image diffusion models with deep language understanding. In Proc. NeurIPS, volume 35, 36479–36494.
  32. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell., 45(4): 4713–4726.
  33. Improved techniques for training gans. In Proc. NeurIPS, volume 29, 2234–2242.
  34. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849.
  35. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2256–2265. PMLR.
  36. 360 Panorama synthesis from a sparse set of images with unknown field of view. In Proc. WACV, 2386–2395.
  37. Recent advances and trends in multimodal deep learning: A review. arXiv preprint arXiv:2105.11087.
  38. High-resolution image synthesis and semantic manipulation with conditional gans. In Proc. CVPR, 8798–8807.
  39. Spherical DNNs and Their Applications in 360 Images and Videos. IEEE Trans. Pattern Anal. Mach. Intell.
  40. Freestyle Layout-to-Image Synthesis. In Proc. CVPR, 14256–14266.
  41. HORIZON: A High-Resolution Panorama Synthesis Framework. arXiv preprint arXiv:2210.04522.
  42. Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation. In Proc. CVPR, 16917–16927.
  43. Adding Conditional Control to Text-to-Image Diffusion Models. In Proc. ICCV, 3836–3847.
  44. DiffCollage: Parallel Generation of Large Content with Diffusion Models. arXiv preprint arXiv:2303.17076.
  45. Entropy-driven sampling and training scheme for conditional diffusion generation. In Proc. ECCV, 754–769. Springer.
  46. LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation. In Proc. CVPR, 22490–22499.
  47. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Proc. ECCV.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com