Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Modal Hallucination Control by Visual Information Grounding (2403.14003v1)

Published 20 Mar 2024 in cs.CV, cs.LG, and cs.CL

Abstract: Generative Vision-LLMs (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  2. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  4. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  5. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
  6. The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839, 2021.
  7. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, 2016. Association for Computational Linguistics.
  8. Contrastive decoding: Open-ended text generation as optimization, 2023a.
  9. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  10. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  11. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  12. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  13. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  14. The identification of nonlinear discrete-time fading-memory systems using neural network models. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 41(11):740–751, 1994.
  15. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023.
  16. Pointwise mutual information based metric and decoding strategy for faithful generation in document grounded dialogs. arXiv preprint arXiv:2305.12191, 2023.
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  18. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  19. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  20. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, 2018. Association for Computational Linguistics.
  21. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
  22. Measuring and predicting object importance. Int. J. Comput. Vis., 91(1):59–76, 2011.
  23. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  25. Mutual information alleviates hallucinations in abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5956–5965, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  26. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  27. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023.
  28. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  29. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  30. A novel deep neural network architecture for non-linear system identification. IFAC-PapersOnLine, 54(7):186–191, 2021. 19th IFAC Symposium on System Identification SYSID 2021.
  31. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
  32. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Alessandro Favero (13 papers)
  2. Luca Zancato (21 papers)
  3. Matthew Trager (30 papers)
  4. Siddharth Choudhary (8 papers)
  5. Pramuditha Perera (23 papers)
  6. Alessandro Achille (60 papers)
  7. Ashwin Swaminathan (18 papers)
  8. Stefano Soatto (179 papers)
Citations (34)