Hallucination in Multimodal LLMs

Updated 2 November 2025

Hallucination in MLLMs is defined as generating outputs inconsistent with the visual evidence, often due to real-world perturbations like noise or cropping.
The Hallu-PI benchmark quantifies hallucination using metrics such as CHAIR, Cover, and Acc+, systematically evaluating existence, attribute, and relation errors.
Mitigation strategies like Perturbed-Reminder and Perturbed-ICL help reduce hallucinations, yet achieving robust performance in safety-critical applications remains a challenge.

Hallucination in Multimodal LLMs (MLLMs) refers to the phenomenon in which a model generates output—such as a caption or answer—that is not consistent with the given visual input (image) or the actual scene, often fabricating, misattributing, or misunderstanding entities, attributes, or relations present in the visual data. This challenge is accentuated in real-world scenarios where visual inputs are perturbed by noise, cropping, blurring, or other distortions, and it represents a critical barrier to deploying MLLMs in safety-critical applications.

1. Nature and Taxonomy of Hallucination in MLLMs

MLLM hallucination is formally defined as the generation of content inconsistent with the provided visual evidence. Hallu-PI establishes three principal types of hallucination:

Existence Hallucination: Assertion of objects not present in the image.
Attribute Hallucination: Misrepresentation of object attributes (e.g., color, number).
Relation Hallucination: Incorrect claims about inter-object relationships (e.g., spatial layout or semantic roles).

These are further disambiguated using fine-grained fields, including explicit number, color, and a "Hal-object" indicator to specify the scope of possible hallucinated claims.

In addition to these output-level dimensions, Hallu-PI introduces taxonomical axes on the input dimension via perturbation scenarios, mapping how real-world alterations in the input—such as noise, blur, weather, digital manipulation, concatenation, cropping, or misleading prompts—affect the likelihood and type of hallucination.

2. The Hallu-PI Benchmark: Design and Structure

Hallu-PI is the first systematic benchmark for hallucination assessment in MLLMs given perturbed inputs. It consists of:

1,260 Images across 11 object types, each perturbed according to seven scenarios:
- Image-level: Noise (Gaussian, shot), blur (defocus, motion), weather (fog, snow), digital manipulations (contrast, compression, pixelation), image concatenation, image cropping.
- Prompt-level: Deliberately misleading or misaligned question prompts.
Each instance is paired with detailed annotations covering existence, attribute, and relation hallucinations, as well as auxiliary fields.
Dual Tasks: Both discriminative (yes/no queries, e.g., "Is there a dog in the image?") and generative (free-form description or answer, e.g., "Describe objects in the top-left quadrant"), enabling comprehensive stress-testing.
Prompt Templates: Standardized and perturbation-aware, for task and scenario consistency.

This design enables the measurement of hallucination for both simple and complex visual-language understanding under realistic, noisy, and adversarial conditions.

3. Evaluation Metrics and Experimental Findings

Hallu-PI Metrics:

CHAIR (Hallucination Rate, generative):

$\text{CHAIR}(Res) = 1 - \frac{|\text{Predicted Objects} \cap \text{Annotated Objects}|}{|\text{Predicted Objects}|}$

Fraction of hallucinated (non-existent) objects in output.
Cover (Correctness, generative):

$\text{Cover}(Res) = \frac{|\text{Predicted Objects} \cap \text{Annotated Objects}|}{|\text{Annotated Objects}|}$

Fraction of annotated ground-truth objects recovered.
Hal: Proportion of responses with at least one hallucination.
Cog: Ratio of hallucinated objects correctly matching annotated "potential hallucination targets".
Acc+, Precision, Recall, F1 (discriminative): Standard classification metrics, with Acc+ representing enhanced accuracy (both Yes/No forms must be correct, preventing bias from always guessing a default).
PI-Score: Unified score for both tasks,

$\text{PI-Score} = \frac{1}{2}\left[(1-\text{Hal}) + \text{Accuracy}^+\right]$

Empirical Findings:

Experiments on 12 mainstream MLLMs (including GPT-4V, Gemini-Pro Vision) demonstrate:

Significant Increase in Hallucination on Perturbed Inputs: Hallucinations rise markedly in perturbed vs. unperturbed data (e.g., LLaVA-1.5's CHAIR jumps from 68.5 to 92.3 under image concatenation).
Perturbation-Type Sensitivity: Image cropping and misleading prompts cause the most catastrophic performance drops—MLLMs often hallucinate missing alphabetic elements in cropped text or follow misleading prompts into generating content for non-existent objects.
Model Biases: Even advanced models like Gemini and GPT-4V exhibit scenario-specific weaknesses, with clear biases toward hallucinated existence, attribute, or relation errors, especially after real-world perturbations.
Attribute-Specific Failures: "Number" hallucinations (counting errors) and "Relation" hallucinations (e.g., spatial arrangement confusion) are particularly sensitive to input distortion, with models frequently failing to maintain consistent object counts or spatial logic under perturbation.

Quantitative Example:

Metric	Pre-Perturbation	Post-Cropping Perturbation
Acc+ (top models)	43.4	≤ 30.0
CHAIR (LLaVA, concat)	68.5	92.3

4. Baseline Mitigation Strategies: Perturbed-Reminder and Perturbed-ICL

Two baseline methods are proposed and evaluated in Hallu-PI:

Perturbed-Reminder: Augments the text prompt with explicit reminders about potential perturbations (e.g., "The image shown may be cropped or noisy. Please answer accurately."). This context cue reduces hallucination by refocusing model attention on actual visual evidence and away from priors or assumptions.
Perturbed-ICL (In-Context Learning): Prepends the input with demonstration examples of perturbed images and correct answers, enabling the model to learn, by induction, how to handle perturbations.

Experimental analysis demonstrates that both approaches reduce hallucination rates (Hal decreases; Acc+ increases), but neither fully solves the problem, especially under severe or adversarial perturbations—underscoring the fundamental limitations of current MLLM architectures and the need for deeper robustness.

5. Implications for Robustness, Model Design, and Deployment

Hallu-PI exposes critical vulnerabilities and failure modes:

Systematic Real-World Fragility: Robustness to input perturbation is a major unsolved challenge for current MLLMs, with real-world corruptions seriously undermining factual reliability.
Type- and Scenario-Sensitive Bias: Spectral model biases emerge—some models or training regimes are more sensitive to specific hallucination "axes" or perturbation types, suggesting that evaluation and alignment need to be perturbation-aware.
Incomplete Defenses: Prompt-based and ICL-based defenses offer measurable but only partial mitigation; architectural and pretraining advances are required for substantive robustness.
Research & Safety Implications: Effective hallucination mitigation under input noise is essential for safety-critical deployments (e.g., medical imaging, autonomous driving), and evaluation protocols should stress models with perturbed, not just clean, data.

6. Future Directions and Benchmark Adoption

Key avenues recommended by Hallu-PI for future research include:

Robust MLLM Architectures: New training objectives, architectural strategies, or multimodal pretraining incorporating realistic perturbations.
Generalized Hallucination Detection/Correction: Incorporation of detection metrics or correction modules that can adapt across perturbation types.
Comprehensive Evaluation Pipelines: Routine integration of perturbed-scenario benchmarks (such as Hallu-PI) into all stages of evaluation and deployment.
Domain-Specific Validation: Use of such benchmarks in domains where input corruption or misleading user prompts are commonplace and high risk.

Hallu-PI, as the first systematic perturbation-aware hallucination testbed, anchors a new evaluation standard for the vision-language alignment community.

Summary Table: Hallu-PI Key Properties and Results

Aspect	Details
# Images / Object Types	1,260 / 11
Perturbation Types	7 (Noise, Blur, Weather, Digital, Concat, Crop, Prompt)
Annotation Granularity	Existence, Attribute, Relation
Task Types	Discriminative (Yes/No), Generative
Metrics	CHAIR, Cover, Hal, Cog, Acc+, F1, PI-Score
Key Result	Models’ hallucination rises sharply under perturbation; cropping/prompt-misleading most severe
Baselines	Perturbed-Reminder, Perturbed-ICL
Benchmarked Models	12 (e.g., GPT-4V, Gemini-Pro Vision)

Hallu-PI thus establishes a foundational resource for the systematic study and improvement of hallucination robustness in MLLMs operating on noisy, incomplete, or adversarially perturbed visual inputs (Ding et al., 2 Aug 2024).

PDF Markdown Chat (Pro)

References (1)

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Hallucination in Multimodal Large Language Models (MLLMs).