ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation (2403.01306v3)
Abstract: Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540.
- Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957.
- Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6778–6788.
- Jean Charbonnier and Christian Wartena. 2019. Predicting word concreteness and imagery. In Proceedings of the 13th International Conference on Computational Semantics-Long Papers, pages 176–187. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020. arXiv preprint arXiv:2010.11929.
- Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77:103–166.
- Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
- Quantifying the visual concreteness of words and topics in multimodal datasets. In NAACL.
- Felix Hill and Anna Korhonen. 2014. Learning abstract concept embeddings from multi-modal data: Since you probably can’t see what i mean. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 255–265.
- Multi-modal models for concrete and abstract concept meaning. Transactions of the Association for Computational Linguistics, 2:285–296.
- Text encoders are performance bottlenecks in contrastive vision-language models. arXiv preprint arXiv:2305.14897.
- From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699.
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028.
- T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132.
- Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
- Learning concept abstractness using weak supervision. arXiv preprint arXiv:1809.01285.
- Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6967–6977.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
- Paula J Schwanenflugel. 2013. Why are abstract concepts hard to understand? In The psychology of word meanings, pages 235–262. Psychology Press.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Composition and deformance: Measuring imageability with a text-to-image model. In Proceedings of the The 5th Workshop on Narrative Understanding, pages 106–117, Toronto, Canada. Association for Computational Linguistics.
- Vision-language dataset distillation.
- Multicapclip: Auto-encoding prompts for zero-shot multilingual visual captioning. arXiv preprint arXiv:2308.13218.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133.
- Tinyllama: An open-source small language model.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Moran Yanuka (4 papers)
- Morris Alper (12 papers)
- Hadar Averbuch-Elor (43 papers)
- Raja Giryes (156 papers)