Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive (2401.08815v1)

Published 16 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Attend-and-Excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
  3. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  4. Optimizing DDPM sampling with shortcut fine-tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), ICML, 2023.
  5. Reinforcement learning for fine-tuning text-to-image diffusion models. In NeurIPS, 2023.
  6. Hierarchical patch vae-gan: Generating diverse videos from a single sample. NeruIPS, 2020.
  7. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  9. Denoising diffusion probabilistic models. NeurIPs, 33, 2020.
  10. TIFA: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  11. Segment anything. In ICCV, 2023.
  12. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
  13. mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, 2022a.
  14. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
  15. Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864, 2023a.
  16. Intra-& extra-source exemplar-based style synthesis for improved domain generalization. IJCV, 2023b.
  17. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  18. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  19. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
  20. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  21. Controlling text-to-image diffusion by orthogonal finetuning. arXiv preprint arXiv:2306.07280, 2023.
  22. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  23. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  24. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  25. Assessing generative models via precision and recall. NeurIPS, 2018.
  26. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In ICCV, 2021.
  27. You only need adversarial supervision for semantic image synthesis. In ICLR, 2020.
  28. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  29. Denoising diffusion implicit models. In ICLR, 2020.
  30. Segmenter: Transformer for semantic segmentation. In CVPR, 2021.
  31. Oasis: only adversarial supervision for semantic image synthesis. IJCV, 2022.
  32. Efficient semantic image synthesis via class-adaptive normalization. TPAMI, 2021.
  33. Deep high-resolution representation learning for visual recognition. TPAMI, 2020.
  34. A combined reinforcement learning and model predictive control for car-following maneuver of autonomous vehicles. Chinese Journal of Mechanical Engineering, 2023a.
  35. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  36. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  37. Image synthesis via semantic composition. In ICCV, 2021.
  38. Diffusion-gan: Training gans with diffusion. In ICLR, 2023b.
  39. f-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, 2019.
  40. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  41. Tackling the generative learning trilemma with denoising diffusion gans. In ICLR, 2022.
  42. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  43. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  44. Freestyle layout-to-image synthesis. In CVPR, 2023.
  45. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  46. Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
  47. Scene parsing through ade20k dataset. In CVPR, 2017.
  48. Semantically multi-modal image synthesis. In CVPR, 2020.
Citations (5)

Summary

  • The paper presents a novel adversarial supervision method with multistep unrolling to improve alignment in layout-to-image synthesis.
  • It utilizes a segmentation-based discriminator to provide pixel-level feedback, ensuring accurate adherence to the input layout.
  • The approach boosts model generalization by achieving a 12 mIoU improvement, paving the way for enhanced semantic segmentation.

Introduction

In the expanding sphere of image synthesis, the layout-to-image (L2I) synthesis task poses unique challenges. L2I involves generating images that precisely correspond to a given semantic layout - essentially a map of different semantic labels, such as 'sky', 'tree', 'building', etc. Generating images that align perfectly with such detailed input while also remaining editable through textual prompts has proven to be difficult. Most of the current models struggle to balance this blend of specificity and flexibility.

Advancing L2I Synthesis

The authors of the discussed paper introduce a novel approach named Adversarial Layout-to-Image Diffusion Models (ALDM), which enhances the conventional training of L2I diffusion models. It incorporates adversarial supervision, leveraging a discriminative network to provide pixel-level feedback to the generative network. This is crucial for aligning the denoised images with the input layout accurately. The novel twist is the introduction of a multistep unrolling strategy, which simulates multiple inference steps during training. This ensures that the generated images remain faithful to the layout across all stages of the denoising process.

The Blend of Adversarial Supervision and Unrolling

By integrating adversarial supervision, the model receives direct and explicit feedback to adhere to the input layout. The segmentation-based discriminator acts as a quality check, prompting the generator to improve its alignment with the layout. To enforce this consistency over the noise-removing steps, the multistep unrolling strategy comes into play. It asks the discriminator to assess not just a single generation step but a sequence of them, mirroring the process during actual image synthesis. The results demonstrate that the generated images maintain high compliance with the layout, an achievement that is underscored by substantial gains in mean intersection-over-union (mIoU) points when the method is evaluated.

Practical Implications and Achievements

The practical implications of this approach are significant. The authors showcase ALDM's utility by using it to synthesize diverse images guided by text prompts. They then employ these images to improve domain generalization in semantic segmentation models. ALDM's ability to enhance generalization is considerable, achieving improvements of about 12 mIoU points - a leap forward in terms of a model's ability to adapt to previously unseen data. This advancement has potential applications in real-world scenarios where AI models must perform accurately across various, often unpredictable, conditions.

Conclusion

ALDM makes a compelling case for the use of adversarial supervision and multistep unrolling in image synthesis tasks. The promising results of adherence to layout conditions combined with broad text editability could shape the future techniques in AI-driven image synthesis, particularly in fields that depend on high-precision visual data like autonomous driving and advanced image editing tools. This method melds detail-oriented generation with creative flexibility, paving the way for more versatile and reliable synthetic image production.