ImageInWords: Unlocking Hyper-Detailed Image Descriptions (2405.02793v2)

Published 5 May 2024 in cs.CV and cs.CL

Abstract: Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-LLMs often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest fidelity images, and boost compositional reasoning by up to 6% on ARO, SVO-Probes, and Winoground datasets. We release the IIW Eval benchmark with human judgement labels, object and image-level annotations from our framework, and existing image caption datasets enriched via IIW-model.

References (3)

Citations (12)

View on Semantic Scholar

Summary

The paper presents VisualVerses, a human-in-the-loop framework that enhances image descriptions through iterative refinement to achieve greater detail and accuracy.
It overcomes the limitations of web-scraped alt-text by producing object-level and image-level annotations with an average of 19.1 verbs, 52.5 nouns, and 28 adjectives.
The framework boosts vision-language model performance in text-to-image tasks and fine-tuning, with evaluations showing a +66% preference over existing datasets.

Enhancing Image Descriptions with VisualVerses: A Detailed Look

Enhanced Image Descriptions with VisualVerses

In the field of AI-powered image understanding, the introduction of VisualVerses (IIW), a human-in-the-loop annotation framework, marks a significant advancement in generating highly detailed, accurate image descriptions. This new dataset and methodology aim to provide richer annotations that are free of inaccuracies and fabrications, commonly known as hallucinations.

The Problem with Existing Image Descriptions

Current image datasets suffer from several limitations, primarily due to their reliance on web-scraped alt-text, which is often vague, low in detail, or irrelevant to the image's content. This leads to training models that generate descriptions plagued with inaccuracies and hallucinations. Furthermore, pre-existing models and datasets, such as DCI and DOCCI, while more detailed compared to datasets without human input, still lack the necessary precision and comprehensiveness to guide accurate image descriptions and subsequent applications in vision-LLMs.

The VisualVerses Framework

The VisualVerses annotation methodology begins with an object detection phase where objects within an image are identified. These objects are then used to seed initial object-level descriptions generated by a vision-LLM. Human annotators refine these machine-generated descriptions, enhancing detail and accuracy.

Object-Level Detailing: Initial descriptions are created for individual objects, with human annotators refining and verifying these details.
Image-Level Description: After objects are detailed, a comprehensive image-level description is generated, which is iteratively refined through human input. This ensures a contextually rich final description, capturing not only objects but their interactions and the overall scene.

This iterative, detailed methodology results in descriptions that are significantly more detailed and accurate compared to existing datasets. For instance, in IIW, descriptions consist of an average of 19.1 verbs, 52.5 nouns, and 28 adjectives, indicating a richer linguistic variety and depth.

Evaluation and Results

VisualVerses descriptions outperform existing datasets and models across several dimensions:

Readability and Coherence: IIW descriptions are evaluated as more readable and coherent.
Comprehensive Comparisons: Side-by-side evaluations with datasets like DCI and DOCCI show that IIW descriptions are preferred by +66\% on average, indicating higher quality and detail.
Accuracy in Text-to-Image Generation: In tasks where descriptions are used to generate images, IIW leads to more accurate visual reconstructions, showcasing the practical utility of the dataset.
Model Fine-Tuning: When used to fine-tune other vision-LLMs, IIW data leads to improvements in generating detailed descriptions and enhances compositional reasoning in AI models.

Practical Applications and Future Work

The IIW dataset not only improves the quality of automated image descriptions but also has broader implications for AI applications, such as aiding visually impaired individuals, enhancing text-to-image generation, and improving AI interpretability in visual contexts.

Looking forward, the project aims to expand its dataset to include more diverse and multilingual descriptions, enhancing its applicability and inclusivity across different regions and cultures.

Conclusion

VisualVerses represents a significant step forward in the annotation of visual data for AI applications. By meticulously combining human expertise with machine-generated seeds, it ensures descriptions are not only detailed and comprehensive but also accurate and useful for numerous practical applications. As the dataset grows and evolves, it promises to play a pivotal role in enhancing machine understanding of complex visual scenes.