Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation (2408.00555v1)

Published 1 Aug 2024 in cs.CV, cs.AI, and cs.CL
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

Abstract: Despite the remarkable ability of large vision-LLMs (LVLMs) in image comprehension, these models frequently generate plausible yet factually incorrect responses, a phenomenon known as hallucination.Recently, in LLMs, augmenting LLMs by retrieving information from external knowledge resources has been proven as a promising solution to mitigate hallucinations.However, the retrieval augmentation in LVLM significantly lags behind the widespread applications of LVLM. Moreover, when transferred to augmenting LVLMs, sometimes the hallucination degree of the model is even exacerbated.Motivated by the research gap and counter-intuitive phenomenon, we introduce a novel framework, the Active Retrieval-Augmented large vision-LLM (ARA), specifically designed to address hallucinations by incorporating three critical dimensions: (i) dissecting the retrieval targets based on the inherent hierarchical structures of images. (ii) pinpointing the most effective retrieval methods and filtering out the reliable retrieval results. (iii) timing the retrieval process to coincide with episodes of low certainty, while circumventing unnecessary retrieval during periods of high certainty. To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1.5, Qwen-VL, and mPLUG-Owl2) across four benchmarks. Our empirical observations suggest that by utilizing fitting retrieval mechanisms and timing the retrieval judiciously, we can effectively mitigate the hallucination problem. We hope that this study can provide deeper insights into how to adapt the retrieval augmentation to LVLMs for reducing hallucinations with more effective retrieval and minimal retrieval occurrences.

Alleviating Hallucination in Large Vision-LLMs with Active Retrieval Augmentation

The paper "Alleviating Hallucination in Large Vision-LLMs with Active Retrieval Augmentation" addresses a prevalent issue within Large Vision-LLMs (LVLMs): hallucination. Hallucination occurs when these models generate semantically plausible but factually incorrect responses to queries about images. The research introduces an innovative framework called the Active Retrieval-Augmented large vision-LLM (ARA), designed to mitigate hallucinations by augmenting LVLMs with external knowledge through smart retrieval methodologies.

Key Contributions

This paper is grounded in three main contributions that significantly enhance the LVLMs’ ability to produce more accurate and reliable outputs:

  1. Coarse-to-Fine Retrieval Framework: Understanding the hierarchical structure of images, the paper proposes a dual-phase retrieval mechanism where both coarse (full-image) and fine-grained (object-specific) retrieval processes are employed. This nuanced approach ensures that both the broader context and specific details of the images are factored into the retrieval operations.
  2. Optimal Retrieval Methods: The paper systematically evaluates and identifies the most effective retrieval techniques. This involves comparing various methods of utilizing visual and textual information from the database to augment the model's internal knowledge. The paper also explores the integration of different embedding models to optimize retrieval.
  3. Active Triggering Based on Query Difficulty: To avoid unnecessary retrievals and enhance efficiency, the ARA model includes a mechanism that selectively triggers retrieval processes based on the estimated difficulty of queries. This is determined using a mutual information-based difficulty metric that assesses the dependency of the model’s output on the visual input.

Methodology

Input and Decoding in LVLMs

The input to LVLMs includes both visual and textual data, which are processed to generate a sequence of tokens representing the image and the accompanying text. The retrieval-augmented decoding approach is then employed where external information is smartly integrated based on the retrieval outcome.

Coarse-to-Fine Hierarchical Retrieval

  • Coarse-Grained Retrieval: Utilizes the CLIP model to extract embeddings from both input images and a large database to retrieve visually similar images. This retrieved information provides broad contextual data that enhances the model's understanding.
  • Fine-Grained Retrieval: Targets specific objects depicted within the images. By using a LLM to extract key entities from queries and grounding techniques to locate these entities within the images, the fine-grained retrieval hones in on the most relevant sections of the image, offering detailed and focused external knowledge.

Advanced Reranking and Joint Decoding Mechanisms

Once the retrieval processes are completed, the results are refined through a reranking strategy that ranks the retrieved data based on semantic similarity, ensuring that the most relevant information is utilized. The comprehensive decoding method further integrates these refined results to output more accurate responses.

Experimental Evaluations

The efficacy of the ARA model was validated through extensive empirical evaluations on three LVLMs (LLaVA-1.5, Qwen-VL, and mPLUG-Owl2) across multiple datasets and benchmarks such as POPE, MME, MMStar, and MMbench. The results showcase substantial improvements in mitigating hallucinations:

  • POPE Benchmark: Demonstrated significant enhancements in accuracy, precision, and recall across random, popular, and adversarial settings.
  • MME Subset: Notable improvements in both object-level and attribute-level hallucination scores, with consistent performance increases across all subsets.
  • MMStar and MMbench: Showed superior performance, particularly in subsets requiring augmented reasoning capabilities, reinforcing the effectiveness of the retrieval augmentation.

Discussion and Future Directions

The findings of this paper pave the way for further exploration and refinement in the domain of retrieval-augmented generation. The promising results indicate that smart retrieval mechanisms tailored to trigger only when necessary can substantially improve the reliability and accuracy of LVLMs. Future developments could focus on enhancing the granularity of retrieval methods, refining confidence metrics for triggering retrieval, and expanding the external knowledge database to cover a wider array of topics.

Conclusion

The paper "Alleviating Hallucination in Large Vision-LLMs with Active Retrieval Augmentation" presents significant advancements in addressing the hallucination problem in LVLMs through an innovative retrieval-augmented framework. By incorporating hierarchical retrieval processes, optimizing retrieval methods, and intelligently triggering retrievals based on query difficulty, the ARA model effectively enhances the accuracy and reliability of LVLM outputs, demonstrating promising improvements across multiple benchmarks and model architectures. This research sets a notable precedent for the continued refinement and application of retrieval-augmented generation in vision-LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023).
  3. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–24.
  4. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
  5. Are We on the Right Way for Evaluating Large Vision-Language Models? arXiv preprint arXiv:2403.20330 (2024).
  6. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022).
  7. HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv preprint arXiv:2403.00425 (2024).
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  9. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  10. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18135–18143.
  11. From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846 (2022).
  12. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911 (2023).
  13. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
  14. Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
  15. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023).
  16. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  17. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
  18. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922 (2023).
  19. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024).
  20. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  22. A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends. arXiv preprint arXiv:2407.07403 (2024).
  23. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics (2023).
  24. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023).
  25. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
  26. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  27. Online Robot Navigation and and Manipulation with Distilled Vision-Language Models. arXiv preprint arXiv:2401.17083 (2024).
  28. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
  29. MMBench: Is Your Multi-modal Model an All-around Player? ArXiv abs/2307.06281 (2023). https://api.semanticscholar.org/CorpusID:259837088
  30. Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models. arXiv preprint arXiv:2406.07001 (2024).
  31. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).
  32. Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861 (2023).
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  34. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics 11 (2023), 1316–1331.
  35. LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting. arXiv preprint arXiv:2305.19821 (2023).
  36. Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2840–2849.
  37. Towards Retrieval-Augmented Architectures for Image Captioning. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
  38. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision. Springer, 146–162.
  39. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. PMLR, 492–504.
  40. Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? arXiv preprint arXiv:2406.09072 (2024).
  41. Timo: Towards Better Temporal Reasoning for Language Models. arXiv preprint arXiv:2406.14192 (2024).
  42. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023).
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  44. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In International Conference on Multimedia Modeling. Springer, 32–45.
  45. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257 (2023).
  46. TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
  47. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023).
  48. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
  49. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023).
  50. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849 (2023).
  51. MeaCap: Memory-Augmented Zero-shot Image Captioning. arXiv preprint arXiv:2403.03715 (2024).
  52. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023).
  53. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023).
  54. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1318–1327.
  55. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023).
  56. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
  57. IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding. arXiv preprint arXiv:2402.18476 (2024).
  58. Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaoye Qu (62 papers)
  2. Qiyuan Chen (22 papers)
  3. Wei Wei (424 papers)
  4. Jishuo Sun (1 paper)
  5. Jianfeng Dong (38 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets