Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control (2402.13404v1)
Abstract: While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation. We adapt and investigate several existing cross-attention control methods in the context of ControlNet and identify shortcomings that cause failure (concept bleeding) or image degradation under specific conditions. To address these shortcomings, we develop a novel cross-attention manipulation method in order to maintain image quality while improving control. Qualitative and quantitative experimental studies focusing on challenging cases are presented, demonstrating the effectiveness of the investigated general approach, and showing the improvements obtained by the proposed cross-attention control method.
- A-star: Test-time attention segregation and retention for text-to-image synthesis. arXiv preprint arXiv:2306.14544, 2023.
- Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
- Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Multidiffusion: Fusing diffusion paths for controlled image generation. 2023.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023.
- Sketch2photo: Internet image montage. ACM transactions on graphics (TOG), 28(5):1–10, 2009.
- Zero-shot spatial layout conditioning for text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2174–2183, 2023.
- Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
- Localized text-to-image generation for free via cross attention control. arXiv preprint arXiv:2306.14636, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Semantic photo synthesis. In Computer graphics forum, pages 407–413. Wiley Online Library, 2006.
- Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Image synthesis from layout with locality-aware mask adaption. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13819–13828, 2021.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Training-free location-aware text-to-image synthesis. arXiv preprint arXiv:2304.13427, 2023.
- No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing, 21(12):4695–4708, 2012.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
- Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICLR), 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Kandinsky: An improved text-to-image synthesis with image prior and latent diffusion. arXiv Preprint, 2023.
- High-resolution image synthesis with latent diffusion models. In IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- LAION-5B: An open large-scale dataset for training next generation image-text models. Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Ryu Simo. Paint-with-words, implemented with stable diffusion. https://github.com/cloneofsimo/paint-with-words-sd, 2023.
- Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2022.
- Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10531–10540, 2019.
- You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- R&b: Region and boundary aware zero-shot grounded text-to-image generation. arXiv preprint arXiv:2310.08872, 2023.
- Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022.
- Scenecomposer: Any-level semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22468–22478, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8584–8593, 2019.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023.
- Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
- Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5104–5113, 2020.
- Denis Lukovnikov (11 papers)
- Asja Fischer (63 papers)