Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images (2506.07184v1)

Published 8 Jun 2025 in cs.AI, cs.CL, and cs.CV

Abstract: While multimodal LLMs excel at various tasks, they still suffer from hallucinations, which limit their reliability and scalability for broader domain applications. To address this issue, recent research mainly focuses on objective hallucination. However, for sequential images, besides objective hallucination, there is also behavioral hallucination, which is less studied. This work aims to fill in the gap. We first reveal that behavioral hallucinations mainly arise from two key factors: prior-driven bias and the snowball effect. Based on these observations, we introduce SHE (Sequence Hallucination Eradication), a lightweight, two-stage framework that (1) detects hallucinations via visual-textual alignment check using our proposed adaptive temporal window and (2) mitigates them via orthogonal projection onto the joint embedding space. We also propose a new metric (BEACH) to quantify behavioral hallucination severity. Empirical results on standard benchmarks demonstrate that SHE reduces behavioral hallucination by over 10% on BEACH while maintaining descriptive accuracy.

PDF Abstract

Analyzing Behavioral Hallucination Mitigation in Multimodal LLMs

The paper "Mitigating Behavioral Hallucination in Multimodal LLMs for Sequential Images" provides a comprehensive paper focused on addressing the challenge of hallucinations in Multimodal LLMs (MLLMs), which deal with tasks involving both textual and visual data. These models, while achieving advanced performance in tasks such as visual question answering and image captioning, are prone to generating 'hallucinations'—output content that is semantically coherent but not aligned with the visual input.

Key Contributions

Behavioral vs. Objective Hallucinations: This research distinguishes between objective hallucinations, which concern misrepresentation of objects within images, and behavioral hallucinations, which involve improbable actions or interactions depicted by the model across sequential images. Previous research has largely focused on the former, and this paper makes noteworthy contributions by addressing the latter.
Sequence Hallucination Eradication (SHE) Framework: The authors propose a methodology named SHE, which includes a two-stage process for detecting and mitigating behavioral hallucinations. The process begins with hallucination detection through visual-textual alignment and proceeds to mitigate these hallucinations using orthogonal projection within the joint embedding space.
BEACH Metric: To quantitatively evaluate the severity of behavioral hallucinations, the paper introduces a new metric, BEACH. This metric provides a more nuanced assessment, focusing specifically on the incongruences pertaining to actions and behaviors represented in the image sequences.
Empirical Validation: The effectiveness of the SHE framework is demonstrated across standard benchmarks, achieving a notable reduction in behavioral hallucination by over 10% while preserving the descriptive capability of the MLLMs.

Causal Analysis

The paper explores the causes of behavioral hallucinations in MLLMs, emphasizing two primary factors: the prior-driven effect and the snowball effect.

Prior-Driven Effect: This relates to the inherent biases within MLLMs due to prior training data, which skew interpretations and lead to hallucinations. The paper introduces Co-Occurrence Scores to measure how frequently hallucinated behaviors align with non-hallucinatory behaviors or hallucinated objects, indicating bias.
Snowball Effect: This effect describes how initial mistakes in interpreting sequences can propagate, causing a chain reaction of increased errors. Experiments conducted show that longer sequences and greater sampling rates result in heightened hallucination rates.

Practical and Theoretical Implications

Practically, this research provides an important step toward making MLLMs more reliable in fields requiring rigorous accuracy, such as medical imaging and autonomous driving. Theoretically, the introduction of SHE and the novel evaluation metric BEACH may serve as foundational tools for further research in enhancing multimodal AI systems' interpretive robustness.

Future Outlook

Future research might expand on applying SHE or similar methodologies to other modalities such as audio-visual data. There's also a need to apply these findings to more diverse datasets outside of the current benchmark scope to generalize the framework's efficacy. Open avenues include refining the adaptive temporal windowing further and exploring its applicability to real-time systems.

Overall, the paper's meticulous approach to understanding and mitigating behavioral hallucinations offers a structured pathway to enhance the reliability and utility of MLLMs in real-world applications, positioning this research as a pivotal contribution to the ongoing development of multimodal artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Liangliang You (1 paper)
Junchi Yao (6 papers)
Shu Yang (178 papers)
Guimin Hu (11 papers)
Lijie Hu (50 papers)
Di Wang (407 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos