A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions (2312.08578v2)

Published 14 Dec 2023 in cs.CV

Abstract: Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-LLMs' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

References (45)

Citations (26)

View on Semantic Scholar

Summary

The paper demonstrates that advanced vision-language models face challenges in simultaneously matching dense captions to image subregions and distinguishing hard negatives.
The study introduces a benchmark dataset with 8,012 images and over 1,000 words per image, providing a robust platform for fine-grained image understanding.
The paper finds that fine-tuning on high-quality, aligned image-caption pairs significantly boosts object and attribute recognition performance with less training data.

Overview of the Dense Captions Dataset

The Dense Captions Dataset (DCI) represents a new approach to vision-language datasets, with a focus on providing rich, detailed image annotations. Comprised of 8,012 natural images, the data includes human-generated text that is tightly linked to specific sub-regions within each image. The extensive annotations average over 1,000 words per image, providing a robust basis for detailed image understanding.

Limitations and Insights from Existing Models

Existing vision-LLMs (VLMs) often utilize large-scale datasets with loosely associated image-text pairs. These models struggle with tasks that require deeper understanding of the relations between image details and their descriptions. The necessity for models to appreciate fine-grained visual details is becoming increasingly vital. In light of this, DCI offers a benchmark to evaluate VLMs, focusing on fine-detail understanding through two main evaluations: matching dense captions to image sub-regions, and distinguishing between similar but distinct descriptions (both positive captions and hard negatives).

Evaluation Results Using the Dense Captions Dataset

The evaluation results demonstrate that even the most advanced VLMs have difficulty excelling on both subcrop-caption matching and distinguishing hard negatives simultaneously. While models trained specifically to identify hard negatives do show improvements in that area, this often comes with reduced performance on tasks requiring the matching of image sub-regions to dense captions.

Implications for Model Development and Training

Fine-tuning models on high-quality image-caption pairs gathered from the DCI dataset demonstrates significant gains, particularly in areas testing object and attribute recognition. This is achieved with a considerably smaller training set than what is typically utilized for standard VLM training, showcasing the value of highly aligned training data versus larger volumes of loosely correlated image-text pairs. The dataset's release aims to spur the development of more sophisticated and context-aware VLMs by providing a rich resource for benchmarking and training.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/DCI: Densely Captioned Images (DCI) dataset repository. (23 stars)

Tweets

https://twitter.com/22146921/status/1736145014525194304

YouTube

Show All Videos