DOCCI: Descriptions of Connected and Contrasting Images (2404.19753v1)

Published 30 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

Citations (29)

View on Semantic Scholar

Summary

The paper introduces DOCCI, a novel dataset of 15,000 images with extensive annotations that capture subtle visual differences.
It employs detailed captions focusing on spatial relations, counting, and text rendering to address weaknesses in current models.
Models like PaLI 5B trained with DOCCI show improved performance, steering future research in robust image-text understanding and evaluation.

Exploring DOCCI: A New Dataset for Enhanced Image and Text Understanding

Introduction to DOCCI

Descriptions of Connected and Contrasting Images (DOCCI) is a novel dataset designed to improve the training and evaluation of image-to-text (I2T) and text-to-image (T2I) models. DOCCI includes 15,000 images, each paired with a detailed, human-annotated description, averaging 136 words in length. These descriptions are not only lengthy but are crafted to clearly differentiate each image from related ones, covering multiple challenges such as spatial relations, counting, and text rendering.

What Sets DOCCI Apart?

Detailed Annotations: Each image in DOCCI is accompanied by a long, descriptive caption, carefully annotated to emphasize fine details and subtle distinctions between similar images.
Specialized Image Selection: The images are specially selected to challenge specific weaknesses in AI models, including complex visual reasoning and detail rendering.
Quantitative Enhancements: Models trained on DOCCI, like the PaLI 5B, demonstrate superior performance in generating image descriptions when compared to larger models trained on less detailed datasets.

Implications and Applications

Model Training: The rich annotations and challenging image scenarios make DOCCI an excellent resource for training more robust I2T models.
Model Evaluation: DOCCI's detailed annotations also provide a rigorous testbed for evaluating the precision of T2I models in reproducing complex visual scenes and text descriptions.
Research Development: By highlighting specific weaknesses of current models, DOCCI directs future research towards improving model handling of complex visual tasks and detailed text generation.

Potential for Future Research

The introduction of DOCCI stimulates several potential research directions:

Improving Model Architectures: The dataset’s complexity can motivate enhancement of neural network architectures to better handle detailed visual-textual understanding.
Advancing Evaluation Metrics: Current metrics like FID and CLIPScore have shown limitations; DOCCI pushes for the development of new, more reliable evaluation metrics for detailed image and text generation.

Challenges and Considerations

Despite its strengths, DOCCI, like any dataset, has its limitations which need consideration:

Diversity in Data: The images are mostly from specific geographical locations, which might limit the diversity in the visual data. Future expansions could benefit from including a broader array of images from varied sources.
Potential Biases: As the images were curated with specific challenges in mind, there’s a risk of built-in biases towards certain types of visual or textual features. Researchers should be cautious and aim to complement DOCCI with other datasets.

Conclusion

DOCCI offers a unique asset to the field of AI research in image and text processing by providing high-quality, detailed annotations designed specifically to address the shortcomings of existing models. As such, it not only serves as a tool for improving current technologies but also as a stepping stone for future innovations in visual and linguistic AI applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yasumasa_onoe/status/1785749323378041079

https://twitter.com/_akhaliq/status/1785486907238989871

https://twitter.com/isidentical/status/1810449340453798063

https://twitter.com/arxivsanitybot/status/1785852744013693092

https://twitter.com/GptMaestro/status/1787410675423449450