Getting it Right: Improving Spatial Consistency in Text-to-Image Models (2404.01197v2)
Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.
- Spatext: Spatio-textual representation for controllable image generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023.
- Introducing our multimodal models, 2023.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023b.
- Pali: A jointly-scaled multilingual language-image model, 2023c.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In ICCV, 2023.
- Effectively unbiased fid and inception score and where to find them, 2020.
- Dall·e mini, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers, 2022.
- Investigating negation in pre-trained vision-and-language models. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 350–362, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023a.
- Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023b.
- Can pre-trained text-to-image models generate visual goals for reinforcement learning?, 2023.
- Geneval: An object-focused framework for evaluating text-to-image alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Vqa-lol: Visual question answering under the lens of logic, 2020.
- Semantically distributed robust optimization for vision-and-language inference, 2022.
- Benchmarking spatial relationships in text-to-image generation, 2023.
- Prompt-to-prompt image editing with cross attention control, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
- T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
- Rethinking fid: Towards a better evaluation metric for image generation, 2024.
- Faithscore: Evaluating hallucinations in large vision-language models, 2023.
- kandinsky community. kandinsky, 2023.
- Bk-sdm: Architecturally compressed stable diffusion for efficient text-to-image generation. ICML Workshop on Efficient Systems for Foundation Models (ES-FoMo), 2023.
- Segment anything, 2023.
- Text-image alignment for diffusion-based perception, 2023.
- Similarity of neural network representations revisited, 2019.
- kuprel. min-dalle, 2022.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Karlo-v1.0.alpha on coyo-100m and cc15m. https://github.com/kakaobrain/karlo, 2022.
- Gligen: Open-set grounded text-to-image generation. 2023a.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds, 2023b.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, 2024.
- Microsoft coco: Common objects in context, 2015.
- Improved baselines with visual instruction tuning, 2023a.
- Compositional visual generation with composable diffusion models, 2023b.
- Decoupled weight decay regularization, 2019.
- Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023a.
- Lcm-lora: A universal stable-diffusion acceleration module, 2023b.
- Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
- OpenAI. Dalle-3, 2023a.
- OpenAI. Gpt-4(v), 2023b.
- On aliased resizing and surprising subtleties in gan evaluation, 2022.
- Eclipse: A resource-efficient text-to-image prior for image generations, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Zero-shot text-to-image generation, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022a.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022b.
- A picture is worth a thousand words: Principled recaptioning improves image generation, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023a.
- Paragraph-to-image generation with information-enriched diffusion model, 2023b.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2945–2954, 2023.
- Reco: Region-controlled text-to-image generation, 2022.
- Coca: Contrastive captioners are image-text foundation models, 2022.
- Ifseg: Image-free semantic segmentation via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2967–2977, 2023.
- Adding conditional control to text-to-image diffusion models, 2023a.
- Controllable text-to-image generation with gpt-4, 2023b.
- Recognize anything: A strong image tagging model, 2023c.
- Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
- Multi-lora composition for image generation, 2024.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.