Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Hallucination in Large Vision-Language Models (2402.00253v2)

Published 1 Feb 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Recent development of Large Vision-LLMs (LVLMs) has attracted growing attention within the AI landscape for its practical implementation potential. However, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs. In this comprehensive survey, we dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation. Our scrutiny starts with a clarification of the concept of hallucinations in LVLMs, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in LVLM hallucinations. Subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to LVLMs. Additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. We also critically review existing methods for mitigating hallucinations. The open questions and future directions pertaining to hallucinations within LVLMs are discussed to conclude this survey.

Hallucinations in Large Vision-LLMs: Evaluation, Causes, and Mitigation

The paper "A Survey on Hallucination in Large Vision-LLMs" provides a comprehensive overview of the challenges associated with hallucinations in Large Vision-LLMs (LVLMs), particularly those that arise due to misalignments between visual input and textual output. This survey is particularly relevant for experienced researchers in AI, as LVLMs represent an intersection between computer vision and natural language processing, posing unique challenges.

LVLMs have emerged as a sophisticated evolution of earlier Vision-LLMs, primarily leveraging the capabilities of LLMs such as GPT-4 and LLaMA, and combining them with visual input processing to solve a range of multimodal tasks. While these models show promise across various applications, hallucinations, defined as discrepancies or inaccuracies between visual content and its textual descriptions, significantly hinder their effective deployment.

Evaluation Methods and Benchmarks

The paper presents a detailed examination of current methods and benchmarks for evaluating hallucinations in LVLMs. It categorizes evaluation approaches into those assessing hallucination discrimination and non-hallucinatory generation capabilities. These approaches typically involve either handcrafted pipelines or model-based end-to-end methods. The survey discusses prominent evaluation metrics and benchmarks, highlighting their focus on objects, attributes, and relations within visual content. The development of benchmarks like POPE and CIEM provides structured means to assess LVLMs' ability to accurately interpret visual information without generating hallucinatory outputs. It is crucial for ongoing refinement and selection of evaluation methods to ensure comprehensive assessment of LVLM performance.

Causes of Hallucinations

The paper explores underlying causes of hallucinations, which can stem from various components within LVLMs. Key causes include biases and irrelevance in training data, limitations of vision encoders, and challenges in modality alignment and LLM capabilities. The survey identifies data bias as a significant contributor, where skewed training data may lead LVLMs to generate inaccurate visual descriptions. Furthermore, inherent limitations in vision encoders may fail to capture fine-grained details, exacerbating hallucinations. Misalignment in modalities, particularly attributed to simplistic connection modules, also contributes to the discrepancies.

Mitigation Strategies

To counter hallucinations, researchers have explored multiple strategies focused on each component of LVLMs. Enhancements in training data aim to address biases and enrich annotations to better train models on accurate visual contexts. Improvements in the vision encoder include scaling up image resolution and integrating perceptual enhancements that bolster object-level perception. Advanced connection modules and alignment-optimization techniques aim to refine modality interactions for more accurate outputs. Furthermore, optimizing LLM decoding strategies and aligning model responses with human preferences offer thoughtful mitigation options against hallucinations. The exploration of post-processing mechanisms provides additional avenues for refining outputs and reducing discrepancies.

Future Directions and Conclusion

The survey concludes by discussing prospective research directions, emphasizing the importance of advancing supervision objectives, enriching modalities, and enhancing LVLM interpretability. By addressing these areas, researchers can tackle hallucinations more effectively, thereby driving advancements in LVLM technology.

In summary, the document offers a solid foundation for understanding and addressing hallucinations within LVLMs, highlighting evaluation methodologies, identifying causes, and discussing practical mitigation techniques. This survey serves as a valuable resource for AI researchers focused on improving LVLM reliability and functionality, paving the way for future exploration in creating robust vision-language systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, volume 35, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023.
  4. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  5. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  6. Can we edit multimodal large language models? In EMNLP, 2023.
  7. Fine-grained image captioning with clip reward. In Findings of NAACL, 2022.
  8. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  10. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In EACL, 2023.
  11. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021.
  12. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  13. Imagebind one embedding space to bind them all. In CVPR, 2023.
  14. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  15. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
  16. The curious case of neural text degeneration. In ICLR, 2020.
  17. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  18. Ciem: Contrastive instruction evaluation method for better instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  19. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  20. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
  21. Vcoder: Versatile vision encoders for multimodal large language models. arXiv preprint arXiv:2312.14233, 2023.
  22. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 2023.
  23. Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968, 2023.
  24. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477, 2023.
  25. Volcano: Mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023.
  26. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
  27. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  28. Evaluating object hallucination in large vision-language models. In EMNLP, 2023.
  29. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
  30. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  31. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  32. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  33. Visual instruction tuning. In NeurIPS, 2023.
  34. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.
  35. Vision-and-language pretrained models: A survey. In IJCAI, 2022.
  36. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv preprint arXiv:2310.05338, 2023.
  37. Neural baby talk. In CVPR, 2018.
  38. Evaluation and mitigation of agnosia in multimodal large language models. arXiv preprint arXiv:2309.04041, 2023.
  39. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP, 2023.
  40. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  41. Learning transferable visual models from natural language supervision. In ICML, 2021.
  42. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  43. Object hallucination in image captioning. In EMNLP, 2018.
  44. Learning to summarize with human feedback. In NeurIPS, volume 33, 2020.
  45. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  48. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023.
  49. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023.
  50. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023.
  51. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  52. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  53. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  54. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023.
  55. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779, 2023.
  56. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023.
  57. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  58. Enhancing the spatial awareness capability of multi-modal large language model. arXiv preprint arXiv:2310.20357, 2023.
  59. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
  60. Analyzing and mitigating object hallucination in large vision-language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  61. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Hanchao Liu (8 papers)
  2. Wenyuan Xue (4 papers)
  3. Yifei Chen (58 papers)
  4. Dapeng Chen (33 papers)
  5. Xiutian Zhao (6 papers)
  6. Ke Wang (529 papers)
  7. Liping Hou (4 papers)
  8. Rongjun Li (7 papers)
  9. Wei Peng (164 papers)
Citations (68)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com