Getting it Right: Improving Spatial Consistency in Text-to-Image Models (2404.01197v2)

Published 1 Apr 2024 in cs.CV

Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.

PDF HTML Abstract

Improving Spatial Consistency in Text-to-Image Models through SPRIGHT Dataset and Efficient Training Techniques

Introduction

Spatial relationships in text prompts present a significant challenge in the field of text-to-image (T2I) synthesis. Despite advancements in diffusion models like Stable Diffusion and DALL-E, generating images that adhere to specified spatial relationships remains an area needing improvement. Addressing this, a paper introduces a comprehensive approach encompassing new datasets, training methodologies, and extensive analyses aimed at enhancing the generation of spatially consistent images from textual descriptions.

Creation of the SPRIGHT Dataset

A core finding of the research is the identification of a lack in representation of spatial relationships in existing vision-language datasets. To tackle this, the authors introduce the SPRIGHT dataset, a spatially focused collection created by re-captioning approximately 6 million images from four extensively used vision datasets. SPRIGHT focuses on detailed spatial relationships, significantly surpassing the spatial representation in existing datasets. This is confirmed through a detailed analysis highlighting an overwhelming increase in the presence of spatial relationships in SPRIGHT compared to original datasets.

Methodological Advancements

The paper demonstrates the efficacy of SPRIGHT through remarkable improvements in spatial consistency in generated images. By fine-tuning Stable Diffusion models with a fraction of SPRIGHT, the paper reports a 22% improvement in spatial accuracy, alongside significant enhancements in image fidelity metrics such as FID and CMMD scores. Moreover, a novel training strategy that emphasizes images with a high object count is introduced, leading to state-of-the-art performance in spatial consistency benchmarks like the T2I-CompBench.

Diverse Analyses and Findings

Extensive analyses conducted by the researchers shed light on various factors influencing spatial consistency in T2I models. Key findings include:

Training Data Composition: Models trained with a balanced mix of spatial and general captions from SPRIGHT demonstrate optimal performance, underscoring the value of diverse training data.
Impact of Caption Length: Longer, spatially focused captions lead to better performance, highlighting the importance of detailed descriptions in training data.
Exploring the CLIP Text Encoder: Fine-tuning the CLIP text encoder with spatially enriched captions from SPRIGHT results in enhanced semantic understanding of spatial relationships, offering insights into model improvements at a granular level.

Future Directions and Concluding Remarks

The paper presents a pivotal step towards understanding and improving spatial consistency in text-to-image generation. The SPRIGHT dataset, accompanied by the identified efficient training methodologies, sets a new benchmark in the generation of spatially accurate images. It opens avenues for further research into the intricate dynamics of language and visual representation in AI models.

The open release of the SPRIGHT dataset by the authors not only invites further exploration in this specific area but also supports broader research initiatives aimed at enhancing the capabilities of generative AI in understanding and visualizing spatial relationships.

In conclusion, the paper articulates the complex challenge of spatial consistency in generative AI, presenting a solid foundation of resources and insights for future advancements in the field. The SPRIGHT dataset, alongside the illustrated training techniques and comprehensive analyses, mark significant milestones in the journey towards more sophisticated text-to-image models that can faithfully interpret and visualize the spatial nuances embedded in textual descriptions.