A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions (2312.08578v2)
Abstract: Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-LLMs' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. https://openreview.net/forum?id=4vlGm9gv6c.
- Flamingo: a visual language model for few-shot learning, 2022.
- Pug: Photorealistic and semantically controllable synthetic data for representation learning. In Advances in Neural Information Processing Systems, 2023.
- John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):679–698, 1986. 10.1109/TPAMI.1986.4767851.
- Unsupervised learning of visual features by contrasting cluster assignments, 2021.
- An exploration of hierarchical attention transformers for efficient long document classification, 2022.
- Enhancing sentence embedding with generalized pooling, 2018.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Redcaps: web-curated image-text data created by the people, for the people, 2021.
- Dense and aligned captions (dac) promote compositional reasoning in vl models, 2023a.
- Teaching structured vision&language concepts to vision&language models, 2023b.
- Improved baselines for vision-language pre-training. Transactions on Machine Learning Research (TMLR), 2023.
- Datasheets for datasets, 2021.
- Lora: Low-rank adaptation of large language models, 2021.
- Segment anything, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Elevater: A benchmark and toolkit for evaluating language-augmented visual models, 2022a.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. CoRR, abs/2201.12086, 2022b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
- An inverse scaling law for clip training, 2023b.
- Scaling language-image pre-training via masking, 2023c.
- Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores, 2023.
- Crepe: Can vision-language foundation models reason compositionally?, 2023.
- Im2text: Describing images using 1 million captioned photographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
- Connecting vision and language with localized narratives, 2020.
- Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6967–6977, 2023.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. 10.18653/v1/P18-1238. https://aclanthology.org/P18-1238.
- FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21), 2021. https://arxiv.org/abs/2103.01913.
- No language left behind: Scaling human-centered machine translation, 2022.
- YFCC100m. Communications of the ACM, 59(2):64–73, jan 2016. 10.1145/2812802. https://doi.org/10.1145%2F2812802.
- Winoground: Probing vision and language models for visio-linguistic compositionality, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Mephisto: A framework for portable, reproducible, and iterative crowdsourcing, 2023.
- Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023a.
- Demystifying clip data, 2023b.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10.1162/tacl_a_00166. https://aclanthology.org/Q14-1006.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning, 2023.
- When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. https://openreview.net/forum?id=KRLUvxh8uaX.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
- Poolingformer: Long document modeling with pooling attention, 2022.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations, 2023.