Papers
Topics
Authors
Recent
2000 character limit reached

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models (2410.03577v2)

Published 4 Oct 2024 in cs.CV

Abstract: Despite their impressive capabilities, multimodal LLMs (MLLMs) are prone to hallucinations, i.e., the generated content that is nonsensical or unfaithful to input sources. Unlike in LLMs, hallucinations in MLLMs often stem from the sensitivity of text decoder to visual tokens, leading to a phenomenon akin to "amnesia" about visual information. To address this issue, we propose MemVR, a novel decoding paradigm inspired by common cognition: when the memory of an image seen the moment before is forgotten, people will look at it again for factual answers. Following this principle, we treat visual tokens as supplementary evidence, re-injecting them into the MLLM through Feed Forward Network (FFN) as "key-value memory" at the middle trigger layer. This "look-twice" mechanism occurs when the model exhibits high uncertainty during inference, effectively enhancing factual alignment. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination across various MLLMs and excels in general benchmarks without incurring additional time overhead. The implementation is available from https://github.com/1zhou-Wang/MemVR

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2897–2905, 2018.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  4. Memory representations in natural tasks. Journal of cognitive neuroscience, 7(1):66–80, 1995.
  5. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1818–1826, 2024.
  6. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation. In Forty-first International Conference on Machine Learning, 2024.
  7. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2023.
  8. Entropy, relative entropy and mutual information. Elements of Information Theory, 2(1):12–13, 1991.
  9. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13668–13677, 2024.
  11. Depth-adaptive transformer. In International Conference on Learning Representations, pp.  1–14, 2020.
  12. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024.
  13. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  14. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, 2021.
  15. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  18135–18143, 2024.
  16. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  17. Visual search has no memory. Nature, 394(6693):575–577, 1998.
  18. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13418–13427, 2024a.
  19. Visual hallucinations of multi-modal large language models. arXiv preprint arXiv:2402.14683, 2024b.
  20. Memory-space visual prompting for efficient vision-language fine-tuning. In Forty-first International Conference on Machine Learning, 2024.
  21. Bert’s output layer recognizes all hidden layers? some intriguing phenomena and a simple way to boost bert. arXiv preprint arXiv:2001.09309, 2020.
  22. Exposing and mitigating spurious correlations for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2585–2595, 2023.
  23. Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 2024.
  24. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13872–13882, 2024.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, pp.  19730–19742. PMLR, 2023a.
  27. Contrastive decoding: Open-ended text generation as optimization. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023b.
  28. Evaluating object hallucination in large vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023c.
  29. Has multimodal learning delivered universal intelligence in healthcare? a comprehensive survey. arXiv preprint arXiv:2408.12880, 2024.
  30. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
  31. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023a.
  32. Evolving normalization-activation layers. Advances in Neural Information Processing Systems, 33:13539–13550, 2020.
  33. Visual instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  34. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
  35. Gaussian process training with input noise. Advances in Neural Information Processing Systems, 24, 2011.
  36. OpenAI. Gpt-4 technical report. URL https://cdn.openai.com/papers/gpt-4.pdf, 2023.
  37. J Kevin O’regan. Solving the” real” mysteries of visual perception: the world as an outside memory. Canadian Journal of Psychology/Revue canadienne de psychologie, 46(3):461, 1992.
  38. Convis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. arXiv preprint arXiv:2408.13906, 2024.
  39. Look, compare, decide: Alleviating hallucination in large vision-language models via multi-view multi-path reasoning. arXiv preprint arXiv:2408.17150, 2024.
  40. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  41. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  4035–4045, 2018.
  42. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
  43. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp.  783–791, 2024.
  44. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  3784–3803, 2021.
  45. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp.  2464–2469. IEEE, 2016.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. Sight beyond text: Multi-modal training enhances llms in truthfulness and ethics. arXiv preprint arXiv:2309.07120, 2023.
  48. The evolution of multimodal model architectures. arXiv preprint arXiv:2405.17927, 2024.
  49. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  50. Next-gpt: Any-to-any multimodal llm. In International Conference on Machine Learning, 2024.
  51. Im-rag: Multi-round retrieval-augmented generation through learning inner monologues. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  730–740, 2024.
  52. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  53. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems, 36, 2024.
  54. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12944–12953, 2024a.
  55. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13807–13816, 2024b.
  56. Mm-vet: Evaluating large multimodal models for integrated capabilities. In Forty-first International Conference on Machine Learning, 2024c.
  57. Investigating the catastrophic forgetting in multimodal large language models. In NeurIPS Workshop on Instruction Tuning and Instruction Following, 2023.
  58. Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models. arXiv preprint arXiv:2408.09429, 2024.
  59. Analyzing and mitigating object hallucination in large vision-language models. In International Conference on Learning Representations, 2024.
  60. Dpnet: Dynamic poly-attention network for trustworthy multi-modal classification. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  3550–3559, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.