Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models (2405.09589v4)

Published 15 May 2024 in cs.LG, cs.AI, cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (122)
  1. Creating trustworthy llms: Dealing with hallucinations in healthcare ai. arXiv preprint arXiv:2311.01463.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  3. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890.
  4. Audiolm: a language modeling approach to audio generation.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Maxm: Towards multilingual visual question answering. arXiv preprint arXiv:2209.05401.
  7. Purr: Efficiently editing language model hallucinations by denoising language model corruptions. arXiv preprint arXiv:2305.14908.
  8. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1206–1210. IEEE.
  9. Unveiling the siren’s song: Towards reliable fact-conflicting hallucination detection. arXiv preprint arXiv:2310.12086.
  10. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368.
  11. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  12. Cheng-Yu Chuang and Pooyan Fazli. 2023. Clearvid: Curriculum learning for video description. arXiv preprint arXiv:2311.04480.
  13. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
  14. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  15. Large legal fictions: Profiling legal hallucinations in large language models. arXiv preprint arXiv:2401.01301.
  16. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  17. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. arXiv preprint arXiv:2210.07688.
  18. How ready are pre-trained abstractive models and llms for legal case judgement summarization? arXiv preprint arXiv:2306.01248.
  19. Pam: Prompting audio-language models for audio quality assessment. arXiv preprint arXiv:2402.00282.
  20. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  21. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372.
  22. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
  23. Natural language supervision for general-purpose audio representations. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 336–340. IEEE.
  24. Diego de Vargas Feijo and Viviane P Moreira. 2023. Improving abstractive summarization of legal rulings through textual entailment. Artificial intelligence and law, 31(1):91–113.
  25. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726.
  26. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731.
  27. Recap: retrieval-augmented audio captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1161–1165. IEEE.
  28. Compa: Addressing the gap in compositional reasoning in audio-language models. arXiv preprint arXiv:2310.08753.
  29. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566.
  30. Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  31. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143.
  32. Automated audio captioning with weakly supervised pre-training and word selection methods. In DCASE, pages 6–10.
  33. Multi-granularity aggregation transformer for joint video-audio-text representation learning. IEEE Transactions on Circuits and Systems for Video Technology.
  34. Let’s think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 204–219.
  35. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696.
  36. Ciem: Contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301.
  37. Jie Huang and Kevin Chen-Chuan Chang. 2023. Citation: A key to building responsible and accountable large language models. arXiv preprint arXiv:2307.02185.
  38. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
  39. Visual hallucinations of multi-modal large language models. arXiv preprint arXiv:2402.14683.
  40. Vladimir Iashin and Esa Rahtu. 2020. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 958–959.
  41. Dehallucinating large language models using formal methods guided iterative prompting. In 2023 IEEE International Conference on Assured Autonomy (ICAA), pages 149–152. IEEE.
  42. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843.
  43. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE access, 7:90368–90377.
  44. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477.
  45. Haoqiang Kang and Xiao-Yang Liu. 2023. Deficiency of large language models in finance: An empirical examination of hallucination. arXiv preprint arXiv:2311.15548.
  46. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690.
  47. Grounding Knowledge. The knowledge alignment problem: Bridging human and external knowledge for large language models.
  48. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17089–17099.
  49. Sparks of large audio models: A survey and outlook. arXiv preprint arXiv:2308.12792.
  50. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922.
  51. Audio-journey: Efficient visual+ llm-aided audio encodec diffusion. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  52. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  53. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  54. Chain of knowledge: A framework for grounding large language models with structured knowledge bases. arXiv preprint arXiv:2305.13269.
  55. Valhalla: Visual hallucination for machine translation. in 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5206–5216.
  56. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253.
  57. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
  58. Revo-lion: Evaluating and refining vision-language instruction tuning datasets. arXiv preprint arXiv:2310.06594.
  59. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
  60. A survey on hallucination in large vision-language models.
  61. Hui Liu and Xiaojun Wan. 2023. Models see hallucinations: Evaluating the factuality in video captioning. arXiv preprint arXiv:2303.02961.
  62. Phd: A prompted visual hallucination evaluation dataset. arXiv preprint arXiv:2403.11116.
  63. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv preprint arXiv:2310.05338.
  64. Evaluation and enhancement of semantic grounding in large vision-language models. In AAAI-ReLM Workshop.
  65. Zero-resource hallucination prevention for large language models. arXiv preprint arXiv:2309.02654.
  66. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  67. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552.
  68. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  69. Content matters: A qualitative analysis of verbal hallucinations. Frontiers in Psychology, 9:123.
  70. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908.
  71. Streamlined dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6588–6597.
  72. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arxiv [cs. cl]. 2023.
  73. On the audio hallucinations in large audio-video language models. arXiv preprint arXiv:2401.09774.
  74. Internal video inpainting by implicit long-range propagation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14579–14588.
  75. Med-halt: Medical domain hallucination test for large language models.
  76. Check your facts and try again: Improving large language models with external knowledge and automated feedback.(2023). arXiv preprint cs.CL/2302.12813.
  77. The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations.
  78. Visual hallucination: Definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2403.17306.
  79. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
  80. Factoid: Factual entailment for hallucination detection. arXiv preprint arXiv:2403.19113.
  81. " sorry, come again?" prompting–enhancing comprehension and diminishing hallucination with [pause]-injected optimal paraphrasing. arXiv preprint arXiv:2403.18976.
  82. Sohini Roychowdhury. 2024. Journey of hallucination-minimized generative ai solutions for financial decision makers. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1180–1181.
  83. Hallucination-minimized data-to-answer framework for financial decision-makers. In 2023 IEEE International Conference on Big Data (BigData), pages 4693–4702. IEEE.
  84. A systematic survey of prompt engineering in large language models: Techniques and applications.
  85. Explaining legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525.
  86. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  87. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17929–17938.
  88. An efficient framework for dense video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12039–12046.
  89. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  90. Adversarial semantic hallucination for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 318–327.
  91. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209.
  92. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.
  93. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation.
  94. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  95. Vigc: Visual instruction generation and correction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5309–5317.
  96. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
  97. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  98. Weilun Wu and Yang Gao. 2023. A context-aware model with a pre-trained context encoder for dense video captioning. In International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2023), volume 12718, pages 387–396. SPIE.
  99. Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2756–2764.
  100. Secap: Speech emotion captioning with large language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19323–19331.
  101. Chartbench: A benchmark for complex visual reasoning in charts. arXiv preprint arXiv:2312.15915.
  102. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
  103. Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling. arXiv preprint arXiv:2402.06118.
  104. A new benchmark and reverse validation method for passage-level hallucination detection. arXiv preprint arXiv:2310.06498.
  105. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  106. Improving the performance of automated audio captioning via integrating the acoustic and semantic information. arXiv preprint arXiv:2110.06100.
  107. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614.
  108. Deficiency-aware masked transformer for video inpainting. arXiv preprint arXiv:2307.08629.
  109. Retrieval-augmented text-to-audio generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 581–585. IEEE.
  110. Video-based facial expression hallucination: A two-level hierarchical fusion approach. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 513–521. Springer.
  111. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
  112. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112.
  113. Mitigating language model hallucination with interactive question-knowledge alignment. arXiv preprint arXiv:2305.13669.
  114. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  115. Mitigating object hallucination in large vision-language models via classifier-free guidance. arXiv preprint arXiv:2402.08680.
  116. Learning video representations from large language models.
  117. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839.
  118. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  119. Streaming dense video captioning. arXiv preprint arXiv:2404.01297.
  120. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
  121. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
  122. Cacophony: An improved contrastive audio-text model. arXiv preprint arXiv:2402.06986.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Pranab Sahoo (5 papers)
  2. Prabhash Meharia (1 paper)
  3. Akash Ghosh (14 papers)
  4. Sriparna Saha (48 papers)
  5. Vinija Jain (43 papers)
  6. Aman Chadha (110 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.