Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset (2403.11116v3)

Published 17 Mar 2024 in cs.CV and cs.AI

Abstract: Multimodal LLMs (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  2. On the dangers of stochastic parrots: Can language models be too big? In ACM FAccT, 2021.
  3. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023.
  4. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023a.
  5. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In EACL, 2023b.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. GLM: General language model pretraining with autoregressive blank infilling. In ACL, 2022.
  8. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  9. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
  10. Detecting and preventing hallucinations in large vision language models, 2024.
  11. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.
  12. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023a.
  13. Language is not all you need: Aligning perception with language models, 2023b.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  15. Faithscore: Evaluating hallucinations in large vision-language models, 2023.
  16. An analysis of visual question answering algorithms. In ICCV, 2017.
  17. Large language models struggle to learn long-tail knowledge. In ICML, 2023.
  18. Volcano: Mitigating multimodal hallucination through self-feedback guided revision, 2023.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  20. Evaluating object hallucination in large vision-language models. In EMNLP, 2023b.
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  22. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
  23. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
  24. Improved baselines with visual instruction tuning, 2023c.
  25. Visual instruction tuning, 2023d.
  26. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  28. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  29. Object hallucination in image captioning. In EMNLP, 2018.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023a.
  32. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023b.
  33. mplug-owl: Modularization empowers large language models with multimodality, 2023a.
  34. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023b.
  35. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023c.
  36. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  37. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data, 2023.
  38. Halle-switch: Controlling object hallucination in large vision language models, 2023.
  39. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiazhen Liu (20 papers)
  2. Yuhan Fu (5 papers)
  3. Ruobing Xie (97 papers)
  4. Runquan Xie (3 papers)
  5. Xingwu Sun (32 papers)
  6. Fengzong Lian (10 papers)
  7. Zhanhui Kang (45 papers)
  8. Xirong Li (64 papers)
Citations (10)