Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 34 tok/s
GPT-5 High 44 tok/s Pro
GPT-4o 107 tok/s
GPT OSS 120B 483 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

Visual Hallucinations of Multi-modal Large Language Models (2402.14683v2)

Published 22 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 2022.
  2. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv, 2023.
  3. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023.
  4. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv, 2023.
  5. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv, 2023.
  6. Language is not all you need: Aligning perception with language models. arXiv, 2023.
  7. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In EMNLP, 2023.
  8. Survey of hallucination in natural language generation. ACM Computing Surveys, 2023.
  9. Visual evidence prompting mitigates hallucinations in multimodal large language models. In ICLR, 2024.
  10. Evaluating object hallucination in large vision-language models. In EMNLP, 2023.
  11. Microsoft coco: Common objects in context. In ECCV, 2014.
  12. Mitigating hallucination in large multi-modal models via robust instruction tuning. In ICLR, 2024.
  13. A survey on hallucination in large vision-language models. arXiv, 2024.
  14. Improved baselines with visual instruction tuning. arXiv, 2023.
  15. Midjourney. Midjourney. https://www.midjourney.com. 2024-02-10.
  16. Dinov2: Learning robust visual features without supervision. TMLR, 2023.
  17. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
  18. Learning transferable visual models from natural language supervision. In ICML, 2021.
  19. A survey of hallucination in large foundation models. arXiv, 2023.
  20. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  21. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. In EMNLP, 2023.
  22. The White House. Fact sheet: President biden issues executive order on safe, secure, and trustworthy artificial intelligence, 2023. Accessed: 2023-11-18.
  23. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. In EMNLP, 2022.
  24. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2023.
  25. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv, 2024.
  26. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In ICMM, 2024.
  27. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv, 2023.
  28. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv, 2023.
  29. Woodpecker: Hallucination correction for multimodal large language models. arXiv, 2023.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv, 2023.
Citations (19)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets