Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (2402.14545v2)
Abstract: Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv preprint, abs/2310.09478.
- Mitigating hallucination in visual language models with visual supervision. ArXiv preprint, abs/2311.16479.
- Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv preprint, abs/2306.13394.
- Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv e-prints, pages arXiv–2310.
- Detecting and preventing hallucinations in large vision language models. ArXiv preprint, abs/2308.06394.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. ArXiv preprint, abs/2311.17911.
- Hallucination augmented contrastive learning for multimodal large language model. ArXiv preprint, abs/2312.06968.
- From clip to dino: Visual encoders shout in multi-modal large language models. ArXiv preprint, abs/2310.08825.
- Faithscore: Evaluating hallucinations in large vision-language models. ArXiv preprint, abs/2311.01477.
- Volcano: mitigating multimodal hallucination through self-feedback guided revision. ArXiv preprint, abs/2311.07362.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. ArXiv preprint, abs/2311.16922.
- Silkie: Preference distillation for large visual language models. ArXiv preprint, abs/2312.10665.
- Evaluating object hallucination in large vision-language models. ArXiv preprint, abs/2305.10355.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Mitigating hallucination in large multi-modal models via robust instruction tuning. ArXiv preprint, abs/2306.14565.
- Improved baselines with visual instruction tuning. ArXiv preprint, abs/2310.03744.
- Visual instruction tuning. ArXiv preprint, abs/2304.08485.
- Are sixteen heads really better than one? In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14014–14024.
- The EOS decision and length extrapolation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 276–291, Online. Association for Computational Linguistics.
- OpenAI. 2023. Introducing chatgpt. https://openai.com/blog/chatgpt.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium. Association for Computational Linguistics.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
- Aligning large multimodal models with factually augmented rlhf. ArXiv preprint, abs/2309.14525.
- Eyes wide shut? exploring the visual shortcomings of multimodal llms. ArXiv preprint, abs/2401.06209.
- Evaluation and analysis of hallucination in large vision-language models. ArXiv preprint, abs/2308.15126.
- Label words are anchors: An information flow perspective for understanding in-context learning. ArXiv preprint, abs/2305.14160.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In International Conference on Multimedia Modeling, pages 32–45. Springer.
- OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR.
- A survey on multimodal large language models. ArXiv preprint, abs/2306.13549.
- Woodpecker: Hallucination correction for multimodal large language models. ArXiv preprint, abs/2310.16045.
- Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. ArXiv preprint, abs/2311.13614.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. ArXiv preprint, abs/2312.00849.
- Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. ArXiv preprint, abs/2310.01779.
- A survey of large language models. ArXiv preprint, abs/2303.18223.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. ArXiv preprint, abs/2311.16839.
- Analyzing and mitigating object hallucination in large vision-language models. ArXiv preprint, abs/2310.00754.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592.