Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding (2402.18476v1)

Published 28 Feb 2024 in cs.CV

Abstract: Despite achieving rapid developments and with widespread applications, Large Vision-LLMs (LVLMs) confront a serious challenge of being prone to generating hallucinations. An over-reliance on linguistic priors has been identified as a key factor leading to these hallucinations. In this paper, we propose to alleviate this problem by introducing a novel image-biased decoding (IBD) technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, thereby amplifying the correct information highly correlated with image content while mitigating the hallucinatory errors caused by excessive dependence on text. We further conduct a comprehensive statistical analysis to validate the reliability of our method, and design an adaptive adjustment strategy to achieve robust and flexible handling under varying conditions. Experimental results across multiple evaluation metrics verify that our method, despite not requiring additional training data and only with a minimal increase in model parameters, can significantly reduce hallucinations in LVLMs and enhance the truthfulness of the generated response.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  2. Mocha: Multi-objective reinforcement mitigating caption hallucinations. arXiv preprint arXiv:2312.03631, 2023.
  3. Audio chord recognition with recurrent neural networks. In ISMIR, pp.  335–340. Curitiba, 2013.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  6. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2018.
  9. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  10. Visual perturbation-aware collaborative learning for overcoming the language prior problem. arXiv preprint arXiv:2207.11850, 2022.
  11. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  12. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  13. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  15. Exposing and mitigating spurious correlations for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2584–2594, 2023.
  16. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  17. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  18. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  20. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022.
  21. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  22. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 1(2):9, 2023a.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  24. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  25. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  26. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), pp.  2464–2469. IEEE, 2016.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  28. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023a.
  29. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023b.
  30. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. arXiv preprint arXiv:2312.01701, 2023c.
  31. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023d.
  32. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023e.
  33. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  34. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  35. Pixel aligned language models. arXiv preprint arXiv:2312.09237, 2023.
  36. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  37. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  38. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614, 2023.
  39. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
  40. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
  41. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
  42. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  43. Llafs: When large-language models meet few-shot segmentation. arXiv preprint arXiv:2311.16926, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lanyun Zhu (30 papers)
  2. Deyi Ji (22 papers)
  3. Tianrun Chen (31 papers)
  4. Peng Xu (357 papers)
  5. Jieping Ye (169 papers)
  6. Jun Liu (606 papers)
Citations (28)
X Twitter Logo Streamline Icon: https://streamlinehq.com