AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models (2406.10900v2)

Published 16 Jun 2024 in cs.CV and cs.CL

Abstract: Large vision-LLMs (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AutoHallusion, paving the way for a long battle against hallucinations. The codebase and data can be accessed at https://github.com/wuxiyang1996/AutoHallusion.

PDF HTML Abstract

Exploring A UTO H ALLUSION: Automated Benchmark Generation for LVLM Hallucinations

The paper presented in "A UTO H ALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-LLMs" addresses a critical issue plaguing large vision-LLMs (LVLMs), namely their tendency to generate hallucinations. Hallucinations in this context refer to instances where LVLMs output nonexistent or incorrect descriptions when processing multimodal inputs. This paper contributes to the field by proposing a novel, scalable method to automatically generate benchmarks for hallucination detection, thereby circumventing the limitations of hand-crafted benchmarks.

Objectives and Methodology

The primary objective of this research is to develop a systematic and automated approach to generate diverse hallucination benchmarks for LVLMs. To achieve this, the authors devised A UTO H ALLUSION, a tool designed to generate hallucination-inducing visual-question pairs through automated image manipulation strategies. These strategies aim to exploit LVLM biases toward relying on language priors over actual visual inputs.

The methodology of A UTO H ALLUSION involves:

Scene and Object Manipulation: Leveraging three principal strategies—abnormal object insertion, paired object insertion, and correlated object removal—to induce hallucination by creating conflicts between the visual input and the LVLM's LLM-induced expectations.
Automated Question Generation: Creating questions to probe different aspects of the image, focusing on the existence of objects and their spatial relations to reveal inconsistencies in LVLM responses.
Hallucination Detection: Employing consistency checks and ground truth comparisons to detect hallucinations based on correct or inconsistent answers produced by LVLMs.

Key Findings

The research demonstrates significant success rates, achieving 97.7% and 98.7% in inducing hallucinations on synthetic and real-world datasets, respectively. These results underscore the efficacy of their automated benchmark generation process. The paper also highlights several insights:

LVLMs show more susceptibility to hallucinations when confronted with questions probing object insertions compared to object removals.
GPT-4V exhibits more robustness against hallucinations compared to other tested models, suggesting a correlation between model size, training data diversity, and hallucination resistance.
Real-world data challenges LVLMs more than synthetic data, indicating difficulties in handling complex and varied real-world inputs.

Implications and Future Directions

This work provides a structured and scalable methodology for hallucination benchmark generation, crucial for developing more reliable LVLMs. The proposed A UTO H ALLUSION highlights the need for LVLMs to better integrate visual and language modalities, reducing the over-reliance on language priors.

Practically, these benchmarks can be essential tools for evaluating and improving LVLM architectures and their training processes. Theoretically, this approach encourages a deeper investigation into the cognitive mechanisms leading to hallucinations and how models might align better with human-like perception and reasoning.

Future developments may focus on enhancing the diversity and complexity of the synthesized images, refining probing strategies, and exploring other aspects of hallucination, such as attribute-related tasks. Moreover, the benchmark datasets curated by A UTO H ALLUSION could be expanded and used to systematically analyze LVLM performance across different domains and applications.

Overall, this paper offers a significant contribution to AI research, particularly in enhancing the reliability and accuracy of vision-LLMs by providing robust tools for understanding and mitigating hallucinations.