Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive (2401.08815v1)
Abstract: Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Attend-and-Excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Optimizing DDPM sampling with shortcut fine-tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), ICML, 2023.
- Reinforcement learning for fine-tuning text-to-image diffusion models. In NeurIPS, 2023.
- Hierarchical patch vae-gan: Generating diverse videos from a single sample. NeruIPS, 2020.
- Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Denoising diffusion probabilistic models. NeurIPs, 33, 2020.
- TIFA: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
- Segment anything. In ICCV, 2023.
- Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
- mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864, 2023a.
- Intra-& extra-source exemplar-based style synthesis for improved domain generalization. IJCV, 2023b.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Controlling text-to-image diffusion by orthogonal finetuning. arXiv preprint arXiv:2306.07280, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Assessing generative models via precision and recall. NeurIPS, 2018.
- Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In ICCV, 2021.
- You only need adversarial supervision for semantic image synthesis. In ICLR, 2020.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Denoising diffusion implicit models. In ICLR, 2020.
- Segmenter: Transformer for semantic segmentation. In CVPR, 2021.
- Oasis: only adversarial supervision for semantic image synthesis. IJCV, 2022.
- Efficient semantic image synthesis via class-adaptive normalization. TPAMI, 2021.
- Deep high-resolution representation learning for visual recognition. TPAMI, 2020.
- A combined reinforcement learning and model predictive control for car-following maneuver of autonomous vehicles. Chinese Journal of Mechanical Engineering, 2023a.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Image synthesis via semantic composition. In ICCV, 2021.
- Diffusion-gan: Training gans with diffusion. In ICLR, 2023b.
- f-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, 2019.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Tackling the generative learning trilemma with denoising diffusion gans. In ICLR, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
- Freestyle layout-to-image synthesis. In CVPR, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Semantically multi-modal image synthesis. In CVPR, 2020.