Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding (2403.18715v2)

Published 27 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: Large Vision-LLMs (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. GPT-4 technical report. arXiv preprint.
  2. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 23716–23736.
  3. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint.
  4. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 32897–32912.
  5. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. LMSYS Organization.
  6. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint.
  7. Detecting and preventing hallucinations in large vision language models. In Proceedings of The 38th AAAI Conference on Artificial Intelligence (AAAI).
  8. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS) DGMDA.
  9. Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), pages 6700–6709.
  10. Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint.
  11. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint.
  12. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 19730–19742.
  13. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 22nd International Conference on Machine Learning (ICML), pages 12888–12900.
  14. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), page 12286–12312.
  15. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 292–305.
  16. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), pages 740–755.
  17. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint.
  18. A survey on hallucination in large vision-language models. arXiv preprint.
  19. Improved baselines with visual instruction tuning. arXiv preprint.
  20. Visual instruction tuning. In Proceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS), volume 36.
  21. Learning transferable visual models from natural language supervision. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 8748–8763.
  22. Object hallucination in image captioning. In Proceedings of The 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 4035–4045.
  23. A-OKVQA: A benchmark for visual question answering using world knowledge. In Proceedings of the 16th European Conference on Computer Vision (ECCV), pages 3195–3204.
  24. Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding. In Proceedings of the 18th European Chapter of the Association for Computational Linguistics (EACL).
  25. Aligning large multimodal models with factually augmented RLHF. arXiv preprint.
  26. LLaMA: Open and efficient foundation language models. arXiv preprint.
  27. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint.
  28. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), pages 19175–19186.
  29. Using self-supervised dual constraint contrastive learning for cross-modal retrieval. In Proceedings of the 26th European Conference on Artificial Intelligence (ECAI), pages 2552–2559.
  30. Language over labels: Contrastive language supervision exceeds purely label-supervised classification performance on chest X-Rays. In Proceedings of The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics SRW.
  31. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint.
  32. FERRET: Refer and ground anything anywhere at any granularity. arXiv preprint.
  33. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint.
  34. Overcoming language priors with counterfactual inference for visual question answering. In Proceedings of the 22nd China National Conference on Computational Linguistics (CCL), pages 600–610.
  35. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint.
  36. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint.
  37. Overcoming language priors with self-supervised learning for visual question answering. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), pages 1083–1089.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xintong Wang (30 papers)
  2. Jingheng Pan (4 papers)
  3. Liang Ding (158 papers)
  4. Chris Biemann (78 papers)
Citations (27)