Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Object Coherence in Layout-to-Image Synthesis (2311.10522v6)

Published 17 Nov 2023 in cs.CV and cs.AI

Abstract: Layout-to-image synthesis is an emerging technique in conditional image generation. It aims to generate complex scenes, where users require fine control over the layout of the objects in a scene. However, it remains challenging to control the object coherence, including semantic coherence (e.g., the cat looks at the flowers or not) and physical coherence (e.g., the hand and the racket should not be misaligned). In this paper, we propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules to guide the object coherence for this task. For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images. Instead of simply employing cross-attention between captions and latent images, which addresses the highly relevant layout restriction and semantic coherence requirement separately and thus leads to unsatisfying results shown in our experiments, we develop GSF to fuse the supervision from the layout restriction and semantic coherence requirement and exploit it to guide the image synthesis process. Moreover, to improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence relation into each pixel's generation process. Specifically, we adopt a self-similarity map to encode the physical coherence restrictions and employ it to extract coherent features from text embedding. Through visualization of our self-similarity map, we explore the essence of SCA, revealing that its effectiveness is not only in capturing reliable physical coherence patterns but also in enhancing complex texture generation. Extensive experiments demonstrate the superiority of our proposed method in both image generation quality and controllability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Semantic image synthesis with semantically coupled vq-model. arXiv preprint arXiv:2209.02536, 2022.
  2. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
  3. Spatext: Spatio-textual representation for controllable image generation. In CVPR, pages 18370–18380, 2023.
  4. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  5. Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
  6. Few-shot semantic image synthesis with class affinity transfer. In CVPR, pages 23611–23620, 2023.
  7. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  8. Cvsformer: Cross-view synthesis transformer for semantic scene completion. arXiv preprint arXiv:2307.07938, 2023.
  9. Localized text-to-image generation for free via cross attention control. arXiv preprint arXiv:2306.14636, 2023.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  11. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  12. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  13. Image-to-image translation with conditional adversarial networks. In CVPR, pages 1125–1134, 2017.
  14. Masked and adaptive transformer for exemplar based image translation. In CVPR, pages 22418–22427, 2023.
  15. Denoising diffusion restoration models. In NeurIPS, pages 23593–23606, 2022.
  16. Ablating concepts in text-to-image diffusion models. arXiv preprint arXiv:2303.13516, 2023.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  18. Gligen: Open-set grounded text-to-image generation. In CVPR, pages 22511–22521, 2023b.
  19. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  20. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In NeurIPS, 2019.
  21. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, pages 11461–11471, 2022.
  22. Learning semantic person image generation by region-adaptive normalization. In CVPR, pages 10806–10815, 2021.
  23. Semantic-shape adaptive feature modulation for semantic image synthesis. In CVPR, pages 11214–11223, 2022.
  24. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171, 2021.
  25. Lanit: Language-driven image-to-image translation for unlabeled data. In CVPR, pages 23401–23411, 2023.
  26. Semantic image synthesis with spatially-adaptive normalization. In CVPR, pages 2337–2346, 2019.
  27. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  28. Learning transferable visual models from natural language supervision. In ICLR, pages 8748–8763, 2021.
  29. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  30. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022a.
  31. Image super-resolution via iterative refinement. TPAMI, 45(4):4713–4726, 2022b.
  32. Retrieval-based spatially adaptive normalization for semantic image synthesis. In CVPR, pages 11224–11233, 2022.
  33. You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781, 2020.
  34. Oasis: only adversarial supervision for semantic image synthesis. IJCV, 130(12):2903–2923, 2022.
  35. Efficient semantic image synthesis via class-adaptive normalization. TPAMI, 44(9):4852–4866, 2021.
  36. Semantic probability distribution modeling for diverse semantic image synthesis. TPAMI, 45(5):6247–6264, 2022.
  37. Efficient diffusion models for vision: A survey. arXiv preprint arXiv:2210.09292, 2022.
  38. Attention is all you need. In NeurIPS, 2017.
  39. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022a.
  40. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pages 8798–8807, 2018.
  41. Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022b.
  42. Image synthesis via semantic composition. In CVPR, pages 13749–13758, 2021.
  43. Freestyle layout-to-image synthesis. In CVPR, pages 14256–14266, 2023.
  44. Diffusion-based scene graph to image generation with masked contrastive pre-training. arXiv preprint arXiv:2211.11138, 2022.
  45. Entropy-driven sampling and training scheme for conditional diffusion generation. In ECCV, pages 754–769, 2022.
  46. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In CVPR, pages 22490–22499, 2023.
  47. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  48. Label-guided generative adversarial network for realistic image synthesis. TPAMI, 45(3):3311–3328, 2022.
  49. Sean: Image synthesis with semantic region-adaptive normalization. In CVPR, pages 5104–5113, 2020a.
  50. Semantically multi-modal image synthesis. In CVPR, pages 5467–5476, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yibin Wang (26 papers)
  2. Weizhong Zhang (40 papers)
  3. Cheng Jin (76 papers)
  4. Honghui Xu (7 papers)
  5. Changhai Zhou (7 papers)
Citations (1)