Analyzing and Mitigating Object Hallucination in Large Vision-Language Models (2310.00754v2)

Published 1 Oct 2023 in cs.LG, cs.CL, and cs.CV

Abstract: Large vision-LLMs (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.

PDF Abstract

Analyzing and Mitigating Object Hallucination in Large Vision-LLMs

The paper "Analyzing and Mitigating Object Hallucination in Large Vision-LLMs" addresses a significant challenge in the domain of vision-LLMs: the tendency of such models to generate hallucinated objects in their outputs. This issue, prevalent in Large Vision-LLMs (LVLMs), can lead to inaccuracies in tasks such as visual summarization and reasoning, with potential consequences across various fields, including robotics and medical imaging.

One of the main contributions of this research is the introduction of the LVLM Hallucination Revisor (LURE), a novel post-hoc algorithm designed to rectify object hallucinations in LVLMs. Drawing from statistical analysis, LURE targets three principal factors contributing to hallucinations: co-occurrence of objects, uncertainty during decoding, and object position within generated descriptions.

Key Findings and Methodology

The authors identify that hallucinations often stem from the spurious correlation between frequently co-occurring objects within the training data, a lack of certainty in the model's predictions, and the positional tendency for hallucinations to appear towards the latter part of the descriptions. These insights are supported by both empirical analysis and theoretical examination.

To address these challenges, LURE is developed to revise outputs from existing LVLMs without modifying their underlying architectures. The algorithm accomplishes this by constructing a dataset of hallucinated descriptions, generated through strategically modifying correct captions using prompts fed to a LLM like GPT-3.5. This dataset is then used to train a hallucination revisor that integrates seamlessly into existing LVLMs to improve accuracy by eliminating hallucinations.

Experimental Results

LURE's performance was evaluated across six open-source LVLMs, demonstrating superior results over previous methods as measured by two standard hallucination evaluation metrics: CHAIR $_I$ and CHAIR $_S$ . The rigorous assessment included comparisons against baseline strategies such as chain-of-thought prompting and using teacher networks for guidance.

The results show a marked improvement in reducing both the instance-level hallucination (CHAIR $_I$ ) and sentence-level hallucination (CHAIR $_S$ ). These findings were further substantiated by evaluations from both GPT models and human annotators, indicating LURE's effectiveness in practical application scenarios.

Implications and Future Directions

The implications of mitigating object hallucinations extend across various domains requiring accurate vision-language interfacing. For example, in robotics, reducing hallucinations can lead to more reliable task execution based on visual input. Furthermore, in fields like autonomous driving, medical diagnostics, and human-computer interaction, the ability to produce accurate textual descriptions of visual data is invaluable for both operational success and safety.

The approach introduced by LURE also opens avenues for further research into reducing hallucinations in other modalities, such as addressing similar challenges in purely text-based models or audio-visual systems. Additionally, exploring more advanced techniques for generating and refining hallucinated datasets may yield further improvements in hallucination mitigation.

Conclusion

Overall, this paper contributes significantly to the field of vision-language research by providing a robust framework to analyze and reduce object hallucinations in LVLMs. LURE not only highlights the importance of understanding the statistical underpinnings of hallucinations but also offers a practical solution to enhance the reliability of vision-LLMs in real-world applications. Future work could explore integrating these methodologies with other domains or exploiting more sophisticated machine learning techniques to further enhance model accuracy and reliability.