Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models (2310.05338v2)
Abstract: Object hallucination poses a significant challenge in vision-language (VL) models, often leading to the generation of nonsensical or unfaithful responses with non-existent objects. However, the absence of a general measurement for evaluating object hallucination in VL models has hindered our understanding and ability to mitigate this issue. In this work, we present NOPE (Negative Object Presence Evaluation), a novel benchmark designed to assess object hallucination in VL models through visual question answering (VQA). We propose a cost-effective and scalable approach utilizing LLMs to generate 29.5k synthetic negative pronoun (NegP) data of high quality for NOPE. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions, where the ground truth answers are denoted as NegP (e.g., "none"). Additionally, we evaluate their standard performance on visual questions on 9 other VQA datasets. Through our experiments, we demonstrate that no VL model is immune to the vulnerability of object hallucination, as all models achieve accuracy below 10\% on NegP. Furthermore, we uncover that lexically diverse visual questions, question types with large scopes, and scene-relevant objects capitalize the risk of object hallucination in VL models.
- Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9690–9698.
- Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955–1960.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Anonymous. 2023. Nusawrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. Anonymous preprint under review.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Openflamingo.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
- IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- All you may need for vqa are image captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1947–1963.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Michael A. Covington and Joe D. McFall. 2010. Cutting the gordian knot: The moving-average type–token ratio (MATTR). Journal of Quantitative Linguistics, 17(2):94–100.
- Multimodal end-to-end sparse model for emotion recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5305–5316, Online. Association for Computational Linguistics.
- Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2383–2395, Dublin, Ireland. Association for Computational Linguistics.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Plausible may not be faithful: Probing object hallucination in vision-language pre-training. pages 2136–2148.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Allyson Ettinger. 2020. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
- Mutant: A training paradigm for out-of-distribution generalization in visual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 878–892.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5078–5088.
- Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 939–948.
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1301–1312, Online. Association for Computational Linguistics.
- Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
- ERICA: An empathetic android companion for covid-19 quarantine. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 257–260, Singapore and Online. Association for Computational Linguistics.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- Can large language models truly understand prompts? a case study with negated prompts. In Transfer Learning for Natural Language Processing Workshop, pages 52–62. PMLR.
- Survey of hallucination in natural language generation. ACM Computing Surveys.
- VScript: Controllable script generation with visual presentation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, pages 1–8, Taipei, Taiwan. Association for Computational Linguistics.
- Overcoming language priors in vqa via decomposed linguistic representations. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11181–11188.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910.
- Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973.
- Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818.
- Hallucination in object detection — a study in visual part verification. In 2021 IEEE International Conference on Image Processing (ICIP), pages 2234–2238.
- Roses are red, violets are blue… but should vqa expect them to? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2776–2785.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981.
- Gouthaman KV and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In Computer Vision – ECCV 2020, pages 18–34, Cham. Springer International Publishing.
- Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, Copenhagen, Denmark. Association for Computational Linguistics.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Adversarial vqa: A new benchmark for evaluating the robustness of vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2042–2051.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Which one are you referring to? multimodal object identification in situated dialogue. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 61–72, Dubrovnik, Croatia. Association for Computational Linguistics.
- Every picture tells a story: Image-grounded controllable stylistic story generation. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 40–52, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- R-vqa: Learning visual relation facts with semantic attention for visual question answering. In SIGKDD 2018.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
- Philip M. McCarthy. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Ph.D. thesis, The University of Memphis.
- Philip M McCarthy and Scott Jarvis. 2007. vocd: A theoretical and empirical evaluation. Language Testing, 24(4):459–488.
- Philip M. McCarthy and Scott Jarvis. 2010. MTLD, vocd-d, and HD-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2):381–392.
- Crosslingual generalization through multitask finetuning.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations (ICLR).
- Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing, 31:7264–7278.
- Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12700–12710.
- Counterfactual vqa: A cause-effect look at language bias. pages 12695–12705.
- A COMPREHENSIVE GRAMMAR OF THE ENGLISH LANGUAGE. Longman, New York.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Sunny and dark outside?! improving answer consistency in vqa through entailed question generation. arXiv preprint arXiv:1909.04696.
- Exploring models and data for image question answering. Advances in neural information processing systems, 28.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Cycle-consistency for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6649–6658.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Lucas Shen. 2021. Measuring political media slant using text data.
- Lucas Shen. 2022. LexicalRichness: A small module to compute textual lexical richness.
- Human-adversarial visual question answering. Advances in Neural Information Processing Systems, 34:20346–20359.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Guy Tevet and Jonathan Berant. 2021. Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 326–346, Online. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Craft an iron sword: Dynamically generating interactive game characters by prompting large language models tuned on code. In Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022), pages 25–43, Seattle, United States. Association for Computational Linguistics.
- Git: A generative image-to-text transformer for vision and language. Transactions of Machine Learning Research.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
- PaperRobot: Incremental draft generation of scientific ideas. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy. Association for Computational Linguistics.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR).
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Learning from lexical perturbations for consistent visual question answering. arXiv preprint arXiv:2011.13406.
- Overcoming language priors in visual question answering via distinguishing superficially similar instances. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5721–5729, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume.
- Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8751–8761.
- Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Consensus graph representation learning for better grounded image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3394–3402.
- Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049.
- Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.