ContextRef: Evaluating Referenceless Metrics For Image Description Generation (2309.11710v1)
Abstract: Referenceless metrics (e.g., CLIPScore) use pretrained vision--LLMs to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
- Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6907–6919, 2022.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398. Springer, 2016.
- Openflamingo. Zenodo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
- Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 113–120, 2004.
- Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55:409–442, 2016.
- Unsupervised parsing via constituency tests. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4798–4808, 2020.
- Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5804–5812, 2018.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Data quality in online human-subjects research: Comparisons between MTurk, Prolific, CloudResearch, Qualtrics, and SONA. Plos one, 18(3):e0279720, 2023.
- dzryk. Multimodal few-shot learning by convex combination of token embeddings. Available online: https://colab.research.google.com/drive/1fokumWeasHTo0KXpfeZ6Z0OgLOQ2SUso?usp=sharing, 2023. URL https://colab.research.google.com/drive/1fokumWeasHTo0KXpfeZ6Z0OgLOQ2SUso?usp=sharing.
- Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302, 2013.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- “it’s almost like they’re trying to hide it”: How user-provided image descriptions have failed to make twitter accessible. In The World Wide Web Conference, pp. 549–559, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, 2021.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
- OpenCLIP. Zenodo, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
- Transparent human evaluation for image captioning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3464–3478, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.254. URL https://aclanthology.org/2022.naacl-main.254.
- PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning. arXiv preprint arXiv:2303.08389, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4685–4697, 2022a.
- Concadia: Towards image-based text generation with a purpose. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4667–4684, 2022b.
- Qace: Asking questions to evaluate an image caption. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4631–4638, 2021a.
- Umic: An unreferenced metric for image captioning via contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 220–226, 2021b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv preprint arXiv:2306.01879, 2023.
- Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pp. 873–881, 2017.
- Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756, 2012.
- What’s in an alt tag? exploring caption content priorities through collaborative captioning. ACM Transactions on Accessible Computing (TACCESS), 15(1):1–32, 2022.
- Pragmatic issue-sensitive image captioning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1924–1938, 2020.
- Zero-shot learning by convex combination of semantic embeddings. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
- OpenAI. GPT-4 technical report, 2023. URL https://arxiv.org/abs/2303.08774.
- Prolific. ac—a subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17:22–27, 2018.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Sometimes we want ungrammatical translations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3205–3227, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.275. URL https://aclanthology.org/2021.findings-emnlp.275.
- Sandro Pezzelle. Dealing with semantic underspecification in multimodal nlp. arXiv preprint arXiv:2306.05240, 2023.
- Are multimodal models robust to image and text perturbations? arXiv preprint arXiv:2212.08044, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118, 2020.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045, 2018.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Improved image caption rating–datasets, game, and model. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2023.
- Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2888–2913, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.230. URL https://aclanthology.org/2021.emnlp-main.230.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2443–2449, 2021.
- " person, shoes, tree. is the person naked?" what people with vision impairments want in image descriptions. In Proceedings of the 2020 chi conference on human factors in computing systems, pp. 1–13, 2020.
- Going beyond one-size-fits-all image descriptions to satisfy the information wants of people who are blind or have low vision. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, pp. 1–15, 2021.
- Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
- Vistext: A benchmark for semantically rich chart captioning. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
- Emiel van Miltenburg and Desmond Elliott. Room for improvement in automatic image description: an error analysis. CoRR, abs/1704.04198, 2017. URL http://arxiv.org/abs/1704.04198.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
- Toward supporting quality alt text in computing publications. In Proceedings of the 19th International Web for All Conference, pp. 1–12, 2022.
- Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35:3082–3095, 2022.
- Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 467–479, 2022a.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022b.