Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence (2412.13949v1)

Published 18 Dec 2024 in cs.CL and cs.CV

Abstract: Large vision-LLMs (LVLMs) have made substantial progress in integrating LLMs with visual inputs, enabling advanced multimodal reasoning. Despite their success, a persistent challenge is hallucination-where generated text fails to accurately reflect visual content-undermining both accuracy and reliability. Existing methods focus on alignment training or decoding refinements but primarily address symptoms at the generation stage without probing the underlying causes. In this work, we investigate the internal mechanisms driving hallucination in LVLMs, with an emphasis on the multi-head attention module. Specifically, we introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. Based on this, our findings reveal the presence of vision-aware attention heads that are more attuned to visual information; however, the model's overreliance on its prior language patterns is closely related to hallucinations. Building on these insights, we propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches in mitigating hallucinations, while maintaining high efficiency with negligible additional time overhead.

PDF HTML Abstract

Analyzing and Mitigating Hallucinations in Large Vision-LLMs

The paper "Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence" investigates a pertinent issue in the functioning of Large Vision-LLMs (LVLMs)—hallucination. Hallucination occurs when the text generated by an LVLM does not accurately correspond to the visual content it is supposed to describe, significantly affecting the model's reliability and utility across diverse applications. The paper places a particular emphasis on unraveling the internal mechanics within LVLMs that lead to hallucination, rather than just treating the symptoms at the output stage.

Central to this paper is the introduction of a new metric termed Vision-aware Head Divergence (VHD). VHD is designed to quantify the sensitivity of each attention head within the multi-head attention module to visual context. The authors uncover that while certain attention heads are highly attuned to visual information, the models often over-rely on prior language patterns embedded during training, which can propel hallucination.

To address this, the authors propose Vision-aware Head Reinforcement (VHR), a training-free method that aims to mitigate hallucination by enhancing the model's reliance on the so-called vision-aware attention heads. This is achieved by selectively amplifying the contributions of these heads during the text generation process, thereby reducing the weight of language bias. This method is distinguished from existing approaches by its proactive adjustment of internal model components, avoiding the need for retraining or additional reference models.

The research provides a series of experimental results to substantiate the effectiveness of VHR across multiple fronts. Notably, VHR is shown to outperform state-of-the-art decoding strategies on well-recognized benchmarks for hallucination evaluation, such as CHAIR and POPE. For instance, on the CHAIR benchmark utilizing the LLaVA-1.5 model, VHR achieves reductions in hallucination rates of up to 16.36 and 4.61 for CHAIR$_S$ and CHAIR$_I$ metrics respectively, compared to baseline methods. Furthermore, VHR maintains computational efficiency with negligible additional computational overhead, offering a practical and scalable alternative to more resource-intensive strategies.

The implications of this work are manifold, both in practical and theoretical domains. Practically, VHR offers a pathway to enhance the accuracy and reliability of LVLMs in real-world scenarios, potentially leading to more robust applications in areas requiring precise multimodal reasoning. Theoretically, this work highlights the significance of dissecting and manipulating internal attention mechanisms to address model biases, thereby enriching our understanding of neural model functionalities beyond superficial output evaluation.

Moving forward, future developments inspired by this research could target the integration of VHR-like strategies in more complex multimodal tasks or extend them to other architectures outside the scope of current LVLMs. Moreover, additional inquiry into the behavior of transformer attention heads could yield further insights into designing models that adeptly balance input modalities, optimizing performance while minimizing erroneous outputs.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Jinghan He (15 papers)
Kuan Zhu (8 papers)
Haiyun Guo (15 papers)
Junfeng Fang (45 papers)
Zhenglin Hua (2 papers)
Yuheng Jia (40 papers)
Ming Tang (199 papers)
Tat-Seng Chua (359 papers)
Jinqiao Wang (76 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1871672418005393854

https://twitter.com/AhsanTrilogy/status/1871700396672274521