HallusionBench: Diagnosing LVLM Hallucinations
- HallusionBench is a diagnostic benchmark that evaluates LVLM failure modes by distinguishing language hallucinations from visual illusions.
- It employs a structured design with 346 images, 1,129 yes/no questions, and control groups to pinpoint logical inconsistencies and bias in model responses.
- Empirical results reveal low question-pair accuracy and notable biases in state-of-the-art models, highlighting the need for balanced tuning and uncertainty modeling.
HallusionBench is a comprehensive diagnostic benchmark designed to evaluate the entangled failure modes of hallucination and visual illusion in large vision–LLMs (LVLMs). Unlike previous vision–language benchmarks, which often focus on a limited set of hallucination types (e.g., object existence mistakes) or lack structural controls for distinguishing failure modes, HallusionBench systematically probes the tension between a model's language priors and its ability to ground answers in the visual modality. This diagnostic suite consists of 346 core images (with additional edited variants), 1,129 binary (yes/no) expert-crafted questions, and a question-pair structure enabling fine-grained analysis of logical consistency, bias, and modality-specific failure attribution. The benchmark targets state-of-the-art LVLMs, including GPT-4V(Vision), Gemini Pro Vision, Claude 3, and multiple open-source models, revealing persistent and challenging hallucination phenomena that limit the practical reliability of LVLMs (Guan et al., 2023).
1. Motivation and Diagnostic Scope
The primary motivation behind HallusionBench is the observation that strong LVLMs display two intertwined failure patterns:
- Language hallucination: The model produces answers based on language priors (knowledge learned from pretraining data), disregarding the visual input or even directly contradicting it.
- Visual illusion: The model confidently misinterprets visual data, generating substantiated but factually incorrect output due to errors in image understanding.
Earlier benchmarks either target object hallucination in isolation or use designs (e.g., standard VQA or MCQ) that confound model bias (yes/no response tendencies) with true visual grounding ability. HallusionBench distinguishes itself by introducing control-group structures—editing key visual features or presenting no-image situations—to enable diagnostic comparison of model responses. The benchmark aims to quantify not just accuracy, but also logical consistency, yes/no bias, and the propensity for language-driven vs. vision-driven errors.
2. Benchmark Design and Composition
HallusionBench comprises:
- Visual corpus: 346 distinct images drawn from the Web, supplemented by 181 human-edited variants (approximately 45% of images are edited for control).
- Question structure: 1,129 binary (yes/no) questions, averaging 3.26 per image, partitioned nearly evenly into two types:
- Visual-Dependent (VD): 591 questions that strictly require image content for correct inference.
- Visual-Supplement (VS): 538 questions that are answerable using language/background knowledge but are also tested under visual grounding (e.g., text-based comparative facts verified with a chart/map).
- Control-group pairing: Each “question root” yields a baseline instance and an edited instance (e.g., real image vs. visually manipulated version). This structure enables the assignment of observed model errors to specific failure types: unchanged wrong answer on edited visual (language hallucination), changed but still wrong (visual illusion), or mixed/uncertain.
The dataset also includes a “no-image” placeholder for selected VS questions, to determine whether the model resorts to uncertainty in the absence of input, as expected.
3. Evaluation Methodology and Metrics
Evaluation in HallusionBench is anchored in several psychometric and diagnostic metrics:
- All-accuracy (aAcc):
where is 1 if the model produces the correct answer for image and question , else 0.
- Figure-accuracy (fAcc):
$\mathrm{fAcc} =\frac{1}{|\mathbb{I}|}\sum_{I\in\mathbb{I}} \mathbbm{1}\left( \bigwedge_{q:(I,q)\in\mathcal{V}} b_{M}(I,q) \right)$
assessing the model's consistency over all questions for each image.
- Question-pair accuracy (qAcc):
$\mathrm{qAcc} =\frac1{|\mathbb{Q}|}\sum_{q\in\mathbb{Q}} \mathbbm{1}\left( \bigwedge_{I\in\mathbb{I}_q} b_{M}(I,q) \right)$
measuring the fraction of correct, logically consistent predictions across baseline and edited (control) versions of each question.
- Bias diagnostics:
- Pct. Diff (): Difference in proportion of “yes” answers predicted by the model vs. ground truth. is ideal.
- False-positive ratio (): Among incorrect answers, the proportion answered “yes.” A value near 0.5 indicates no systematic bias.
Model performance is then categorized along these axes, and a four-way decision tree compares answer changes between paired questions/visuals to attribute errors to language hallucination or visual illusion.
4. Empirical Results and Failure Analysis
Evaluations on HallusionBench include 15 models, among them GPT-4V, Claude 3, Gemini Pro Vision, LLaVA-1.5 (13B), and multiple open-source architectures. A random-chance baseline is also computed for reference.
Key results are as follows:
| Model | qAcc | fAcc | aAcc |
|---|---|---|---|
| GPT-4V | 31.42 | 44.22 | 67.58 |
| Claude 3 | 21.76 | 28.61 | 56.86 |
| LLaVA-1.5 | 9.45 | 25.43 | 47.12 |
| BLIP2-T5 | 15.16 | 20.52 | 48.09 |
| mPLUG_Owl-v2 | 13.85 | 19.94 | 47.30 |
| Random chance | 15.60 | 18.21 | 45.96 |
Notable findings:
- Overall performance remains low: Even the strongest model (GPT-4V) achieves only 31.42% question-pair accuracy, while open-source models typically remain below 16%—close to random.
- Bias and error modes:
- Certain models (e.g., LLaVA-1.5, Qwen-VL, Flamingo) exhibit a pronounced “yes” bias (Pct. Diff up to 0.17, FP ratios as high as 0.79), inflating correct rates on questions where “no” is correct.
- Failure mode attribution (exemplified using GPT-4V):
- Language hallucination: 22.19%
- Visual illusion: 45.66%
- Mixed/uncertain: 32.14%
- Per-case studies:
- Optical-illusion manipulation, video-sequence reversal, and chart-reading all expose cases where models ignore visual evidence or remain anchored in language priors despite manipulated input.
5. Structural Innovations and Analytical Power
The distinguishing design principle of HallusionBench is its use of control groups and diagnostic question-pairs. By introducing visually edited versions of the same scene and observing response changes, the benchmark separates:
- Hallucination due to language priors (answer does not change post-edit).
- Illusion due to visual misprocessing (answer changes in response to edit, but remains incorrect).
- Consistency and logical integrity across related queries.
This architecture also allows analysis of logical consistency (via figure-accuracy), yes/no risk profiles, and detection of spurious invariance to critical visual cues.
6. Implications and Recommendations
The empirical analysis of HallusionBench yields a series of technical recommendations:
- Balanced tuning: Incorporate human-edited visuals and “fooler” images during training, as in LRV_Instruction, to harden models against language-prior contamination.
- Joint calibration: Introduce a “vision fidelity loss” that explicitly penalizes unchanged answers across edited control pairs.
- Specialist reasoning modules: Develop geometric and temporal sub-networks for tasks demanding numerical or sequence understanding.
- Uncertainty modeling: Allow the model to respond “uncertain” or abstain when visual and language priors are in conflict, instead of defaulting to overconfident answers.
These insights suggest that genuine progress on hallucination resistance requires not only architectural or scaling improvements but systematic, structure-aware training and evaluation.
7. Accessibility and Future Development
HallusionBench is publicly released at https://github.com/tianyi-lab/HallusionBench, including all code, question-pair annotations, and supplementary diagnostic tools. The openly available resource supports several proposed extensions:
- Adaptation to other modalities: The control-group methodology can inform benchmarks in dialog, grounding, segmentation, and video understanding.
- Diagnostic protocol generalization: The analytical framework is adaptable for future benchmarks that seek to disentangle discrete error sources in different multimodal settings.
- Fine-grained failure mode localization: Rich per-question annotations enable the construction of adversarial datasets and algorithmic feedback loops for model improvement.
In summary, HallusionBench constitutes a rigorous benchmark for quantitatively dissecting the entangled failure mechanics of advanced LVLMs, setting a high standard for future vision–language diagnostic frameworks (Guan et al., 2023).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free