Zero-shot spatial layout conditioning for text-to-image diffusion models (2306.13754v1)
Abstract: Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.
- SpaText: Spatio-textual representation for controllable image generation. arXiv preprint, arXiv:2211.14305, 2022.
- eDiff-I: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint, arXiv:2211.01324, 2022.
- Universal guidance for diffusion models. arXiv preprint, arXiv:2302.07121, 2023.
- MultiDiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint, arXiv:2302.08113, 2023.
- COCO-Stuff: Thing and stuff classes in context. In CVPR, 2018.
- MaskGIT: Masked generative image transformer. In CVPR, 2022.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint, arXiv:2301.13826, 2023.
- Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.
- DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, 40(4):834–848, 2017.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
- DiffEdit: Diffusion-based semantic image editing with mask generation. In ICLR, 2023.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint, arXiv:2212.05032, 2022.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint, arXiv:2208.01626, 2022.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Improved masked image generation with token-critic. In ECCV, 2022.
- GLIGEN: Open-set grounded text-to-image generation. arXiv preprint, arXiv:2301.07093, 2023.
- Image segmentation using text and image prompts. In CVPR, 2022.
- Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153, 2023.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, arXiv:2302.08453, 2023.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Neural discrete representation learning. In NeurIPS, 2017.
- Controllable image generation via collage representations. ICLR submission, 2022.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
- Zero-shot image-to-image translation. arXiv preprint, arXiv:2302.03027, 2023.
- On aliased resizing and surprising subtleties in GAN evaluation. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21, 2022.
- Hierarchical text-conditionalimage generation with CLIP latents. arXiv preprint, arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- You only need adversarial supervision for semantic image synthesis. In ICLR, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2020.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Pretraining is all you need for image-to-image translation. arXiv preprint, arXiv:2205.12952, 2022.
- Semantic image synthesis via diffusion models. arXiv preprint, arXiv:2207.00050, 2022.
- Adding conditional control to text-to-image diffusion models. arXiv preprint, arXiv:2302.05543, 2023.
- Guillaume Couairon (17 papers)
- Marlène Careil (9 papers)
- Matthieu Cord (129 papers)
- Stéphane Lathuilière (79 papers)
- Jakob Verbeek (59 papers)