Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 26 tok/s Pro
2000 character limit reached

Hallucination in Multimodal LLMs

Updated 2 November 2025
  • Hallucination in MLLMs is defined as generating outputs inconsistent with the visual evidence, often due to real-world perturbations like noise or cropping.
  • The Hallu-PI benchmark quantifies hallucination using metrics such as CHAIR, Cover, and Acc+, systematically evaluating existence, attribute, and relation errors.
  • Mitigation strategies like Perturbed-Reminder and Perturbed-ICL help reduce hallucinations, yet achieving robust performance in safety-critical applications remains a challenge.

Hallucination in Multimodal LLMs (MLLMs) refers to the phenomenon in which a model generates output—such as a caption or answer—that is not consistent with the given visual input (image) or the actual scene, often fabricating, misattributing, or misunderstanding entities, attributes, or relations present in the visual data. This challenge is accentuated in real-world scenarios where visual inputs are perturbed by noise, cropping, blurring, or other distortions, and it represents a critical barrier to deploying MLLMs in safety-critical applications.

1. Nature and Taxonomy of Hallucination in MLLMs

MLLM hallucination is formally defined as the generation of content inconsistent with the provided visual evidence. Hallu-PI establishes three principal types of hallucination:

  • Existence Hallucination: Assertion of objects not present in the image.
  • Attribute Hallucination: Misrepresentation of object attributes (e.g., color, number).
  • Relation Hallucination: Incorrect claims about inter-object relationships (e.g., spatial layout or semantic roles).

These are further disambiguated using fine-grained fields, including explicit number, color, and a "Hal-object" indicator to specify the scope of possible hallucinated claims.

In addition to these output-level dimensions, Hallu-PI introduces taxonomical axes on the input dimension via perturbation scenarios, mapping how real-world alterations in the input—such as noise, blur, weather, digital manipulation, concatenation, cropping, or misleading prompts—affect the likelihood and type of hallucination.

2. The Hallu-PI Benchmark: Design and Structure

Hallu-PI is the first systematic benchmark for hallucination assessment in MLLMs given perturbed inputs. It consists of:

  • 1,260 Images across 11 object types, each perturbed according to seven scenarios:
    • Image-level: Noise (Gaussian, shot), blur (defocus, motion), weather (fog, snow), digital manipulations (contrast, compression, pixelation), image concatenation, image cropping.
    • Prompt-level: Deliberately misleading or misaligned question prompts.
  • Each instance is paired with detailed annotations covering existence, attribute, and relation hallucinations, as well as auxiliary fields.
  • Dual Tasks: Both discriminative (yes/no queries, e.g., "Is there a dog in the image?") and generative (free-form description or answer, e.g., "Describe objects in the top-left quadrant"), enabling comprehensive stress-testing.
  • Prompt Templates: Standardized and perturbation-aware, for task and scenario consistency.

This design enables the measurement of hallucination for both simple and complex visual-language understanding under realistic, noisy, and adversarial conditions.

3. Evaluation Metrics and Experimental Findings

Hallu-PI Metrics:

  • CHAIR (Hallucination Rate, generative):

    CHAIR(Res)=1Predicted ObjectsAnnotated ObjectsPredicted Objects\text{CHAIR}(Res) = 1 - \frac{|\text{Predicted Objects} \cap \text{Annotated Objects}|}{|\text{Predicted Objects}|}

    Fraction of hallucinated (non-existent) objects in output.

  • Cover (Correctness, generative):

    Cover(Res)=Predicted ObjectsAnnotated ObjectsAnnotated Objects\text{Cover}(Res) = \frac{|\text{Predicted Objects} \cap \text{Annotated Objects}|}{|\text{Annotated Objects}|}

    Fraction of annotated ground-truth objects recovered.

  • Hal: Proportion of responses with at least one hallucination.
  • Cog: Ratio of hallucinated objects correctly matching annotated "potential hallucination targets".
  • Acc+, Precision, Recall, F1 (discriminative): Standard classification metrics, with Acc+ representing enhanced accuracy (both Yes/No forms must be correct, preventing bias from always guessing a default).
  • PI-Score: Unified score for both tasks,

    PI-Score=12[(1Hal)+Accuracy+]\text{PI-Score} = \frac{1}{2}\left[(1-\text{Hal}) + \text{Accuracy}^+\right]

Empirical Findings:

Experiments on 12 mainstream MLLMs (including GPT-4V, Gemini-Pro Vision) demonstrate:

  • Significant Increase in Hallucination on Perturbed Inputs: Hallucinations rise markedly in perturbed vs. unperturbed data (e.g., LLaVA-1.5's CHAIR jumps from 68.5 to 92.3 under image concatenation).
  • Perturbation-Type Sensitivity: Image cropping and misleading prompts cause the most catastrophic performance drops—MLLMs often hallucinate missing alphabetic elements in cropped text or follow misleading prompts into generating content for non-existent objects.
  • Model Biases: Even advanced models like Gemini and GPT-4V exhibit scenario-specific weaknesses, with clear biases toward hallucinated existence, attribute, or relation errors, especially after real-world perturbations.
  • Attribute-Specific Failures: "Number" hallucinations (counting errors) and "Relation" hallucinations (e.g., spatial arrangement confusion) are particularly sensitive to input distortion, with models frequently failing to maintain consistent object counts or spatial logic under perturbation.

Quantitative Example:

Metric Pre-Perturbation Post-Cropping Perturbation
Acc+ (top models) 43.4 ≤ 30.0
CHAIR (LLaVA, concat) 68.5 92.3

4. Baseline Mitigation Strategies: Perturbed-Reminder and Perturbed-ICL

Two baseline methods are proposed and evaluated in Hallu-PI:

  • Perturbed-Reminder: Augments the text prompt with explicit reminders about potential perturbations (e.g., "The image shown may be cropped or noisy. Please answer accurately."). This context cue reduces hallucination by refocusing model attention on actual visual evidence and away from priors or assumptions.
  • Perturbed-ICL (In-Context Learning): Prepends the input with demonstration examples of perturbed images and correct answers, enabling the model to learn, by induction, how to handle perturbations.

Experimental analysis demonstrates that both approaches reduce hallucination rates (Hal decreases; Acc+ increases), but neither fully solves the problem, especially under severe or adversarial perturbations—underscoring the fundamental limitations of current MLLM architectures and the need for deeper robustness.

5. Implications for Robustness, Model Design, and Deployment

Hallu-PI exposes critical vulnerabilities and failure modes:

  • Systematic Real-World Fragility: Robustness to input perturbation is a major unsolved challenge for current MLLMs, with real-world corruptions seriously undermining factual reliability.
  • Type- and Scenario-Sensitive Bias: Spectral model biases emerge—some models or training regimes are more sensitive to specific hallucination "axes" or perturbation types, suggesting that evaluation and alignment need to be perturbation-aware.
  • Incomplete Defenses: Prompt-based and ICL-based defenses offer measurable but only partial mitigation; architectural and pretraining advances are required for substantive robustness.
  • Research & Safety Implications: Effective hallucination mitigation under input noise is essential for safety-critical deployments (e.g., medical imaging, autonomous driving), and evaluation protocols should stress models with perturbed, not just clean, data.

6. Future Directions and Benchmark Adoption

Key avenues recommended by Hallu-PI for future research include:

  • Robust MLLM Architectures: New training objectives, architectural strategies, or multimodal pretraining incorporating realistic perturbations.
  • Generalized Hallucination Detection/Correction: Incorporation of detection metrics or correction modules that can adapt across perturbation types.
  • Comprehensive Evaluation Pipelines: Routine integration of perturbed-scenario benchmarks (such as Hallu-PI) into all stages of evaluation and deployment.
  • Domain-Specific Validation: Use of such benchmarks in domains where input corruption or misleading user prompts are commonplace and high risk.

Hallu-PI, as the first systematic perturbation-aware hallucination testbed, anchors a new evaluation standard for the vision-language alignment community.


Summary Table: Hallu-PI Key Properties and Results

Aspect Details
# Images / Object Types 1,260 / 11
Perturbation Types 7 (Noise, Blur, Weather, Digital, Concat, Crop, Prompt)
Annotation Granularity Existence, Attribute, Relation
Task Types Discriminative (Yes/No), Generative
Metrics CHAIR, Cover, Hal, Cog, Acc+, F1, PI-Score
Key Result Models’ hallucination rises sharply under perturbation; cropping/prompt-misleading most severe
Baselines Perturbed-Reminder, Perturbed-ICL
Benchmarked Models 12 (e.g., GPT-4V, Gemini-Pro Vision)

Hallu-PI thus establishes a foundational resource for the systematic paper and improvement of hallucination robustness in MLLMs operating on noisy, incomplete, or adversarially perturbed visual inputs (Ding et al., 2 Aug 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hallucination in Multimodal Large Language Models (MLLMs).