Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions (2312.08578v2)

Published 14 Dec 2023 in cs.CV

Abstract: Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-LLMs' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. https://openreview.net/forum?id=4vlGm9gv6c.
  2. Flamingo: a visual language model for few-shot learning, 2022.
  3. Pug: Photorealistic and semantically controllable synthetic data for representation learning. In Advances in Neural Information Processing Systems, 2023.
  4. John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):679–698, 1986. 10.1109/TPAMI.1986.4767851.
  5. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
  6. An exploration of hierarchical attention transformers for efficient long document classification, 2022.
  7. Enhancing sentence embedding with generalized pooling, 2018.
  8. Microsoft coco captions: Data collection and evaluation server, 2015.
  9. Redcaps: web-curated image-text data created by the people, for the people, 2021.
  10. Dense and aligned captions (dac) promote compositional reasoning in vl models, 2023a.
  11. Teaching structured vision&language concepts to vision&language models, 2023b.
  12. Improved baselines for vision-language pre-training. Transactions on Machine Learning Research (TMLR), 2023.
  13. Datasheets for datasets, 2021.
  14. Lora: Low-rank adaptation of large language models, 2021.
  15. Segment anything, 2023.
  16. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  17. Elevater: A benchmark and toolkit for evaluating language-augmented visual models, 2022a.
  18. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. CoRR, abs/2201.12086, 2022b.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
  20. An inverse scaling law for clip training, 2023b.
  21. Scaling language-image pre-training via masking, 2023c.
  22. Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores, 2023.
  23. Crepe: Can vision-language foundation models reason compositionally?, 2023.
  24. Im2text: Describing images using 1 million captioned photographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
  25. Connecting vision and language with localized narratives, 2020.
  26. Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6967–6977, 2023.
  27. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  28. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  29. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  30. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. 10.18653/v1/P18-1238. https://aclanthology.org/P18-1238.
  31. FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
  32. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21), 2021. https://arxiv.org/abs/2103.01913.
  33. No language left behind: Scaling human-centered machine translation, 2022.
  34. YFCC100m. Communications of the ACM, 59(2):64–73, jan 2016. 10.1145/2812802. https://doi.org/10.1145%2F2812802.
  35. Winoground: Probing vision and language models for visio-linguistic compositionality, 2022.
  36. Llama 2: Open foundation and fine-tuned chat models, 2023.
  37. Mephisto: A framework for portable, reproducible, and iterative crowdsourcing, 2023.
  38. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023a.
  39. Demystifying clip data, 2023b.
  40. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10.1162/tacl_a_00166. https://aclanthology.org/Q14-1006.
  41. Scaling autoregressive multi-modal models: Pretraining and instruction tuning, 2023.
  42. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. https://openreview.net/forum?id=KRLUvxh8uaX.
  43. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
  44. Poolingformer: Long document modeling with pooling attention, 2022.
  45. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations, 2023.
Citations (26)

Summary

  • The paper demonstrates that advanced vision-language models face challenges in simultaneously matching dense captions to image subregions and distinguishing hard negatives.
  • The study introduces a benchmark dataset with 8,012 images and over 1,000 words per image, providing a robust platform for fine-grained image understanding.
  • The paper finds that fine-tuning on high-quality, aligned image-caption pairs significantly boosts object and attribute recognition performance with less training data.

Overview of the Dense Captions Dataset

The Dense Captions Dataset (DCI) represents a new approach to vision-language datasets, with a focus on providing rich, detailed image annotations. Comprised of 8,012 natural images, the data includes human-generated text that is tightly linked to specific sub-regions within each image. The extensive annotations average over 1,000 words per image, providing a robust basis for detailed image understanding.

Limitations and Insights from Existing Models

Existing vision-LLMs (VLMs) often utilize large-scale datasets with loosely associated image-text pairs. These models struggle with tasks that require deeper understanding of the relations between image details and their descriptions. The necessity for models to appreciate fine-grained visual details is becoming increasingly vital. In light of this, DCI offers a benchmark to evaluate VLMs, focusing on fine-detail understanding through two main evaluations: matching dense captions to image sub-regions, and distinguishing between similar but distinct descriptions (both positive captions and hard negatives).

Evaluation Results Using the Dense Captions Dataset

The evaluation results demonstrate that even the most advanced VLMs have difficulty excelling on both subcrop-caption matching and distinguishing hard negatives simultaneously. While models trained specifically to identify hard negatives do show improvements in that area, this often comes with reduced performance on tasks requiring the matching of image sub-regions to dense captions.

Implications for Model Development and Training

Fine-tuning models on high-quality image-caption pairs gathered from the DCI dataset demonstrates significant gains, particularly in areas testing object and attribute recognition. This is achieved with a considerably smaller training set than what is typically utilized for standard VLM training, showcasing the value of highly aligned training data versus larger volumes of loosely correlated image-text pairs. The dataset's release aims to spur the development of more sophisticated and context-aware VLMs by providing a rich resource for benchmarking and training.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com