Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio (2410.12787v1)

Published 16 Oct 2024 in cs.CV

Abstract: Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

An Analysis of Hallucinations in Large Multimodal Models

The paper "The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio" presents an in-depth exploration of hallucination tendencies within large multimodal models (LMMs). As LMMs advance and increasingly incorporate diverse modalities such as language, visual, and audio data, their propensity to produce erroneous outputs—termed hallucinations—has become a prominent challenge. The paper systematically scrutinizes these hallucinations, identifying two principal contributors: reliance on unimodal priors and spurious inter-modality correlations.

The introduction of a new benchmark, The Curse of Multi-Modalities (CMM), forms a core contribution of this work. CMM is a comprehensive tool for evaluating hallucinations across language, visual, and audio inputs, delineated through object-level and event-level probing. Specifically, it addresses hallucinations by transforming them into binary classification challenges, utilizing $1,200$ curated video/audio samples accompanied by $2,400$ corresponding probing questions. These investigative queries are crafted to assess present and absent objects or events within given multimodal contexts.

Key Findings

The paper reveals that LMMs often default to unimodal priors, which are the entrenched patterns learned from individual modalities. This tendency is highlighted through:

  • Language Dominance: LMMs predominantly influenced by LLMs can generate content reflective of linguistic patterns, even when visual or audio data signal contrary information.
  • Visual Dominance: An overemphasis on visual cues may lead LMMs to overlook critical linguistic or auditory input, resulting in imbalanced outputs.
  • Audio Dominance: Similarly, an overreliance on audio cues can induce hallucinations related to absent visual content.

In addition to unimodal reliance, LMM hallucinations are frequently exacerbated by spurious inter-modality correlations. These spurious patterns arise from coincidental regularities within multimodal training datasets. For instance:

  • Visual-Language (VL) Correlations: Models may falsely generate visual content based on language signals, due to patterns of frequent co-occurrence in training datasets.
  • Audio-Language (AL) Correlations: Erroneous associations can cause models to hallucinate audio events from textual descriptions.
  • Visual-Audio-Language (VAL) Correlations: This more complex interaction pertains to simultaneous misalignments across all three modalities.

Benchmark and Contributions

The CMM benchmark encapsulates a nuanced framework categorizing LMM vulnerabilities under the aforementioned contributors. It employs metrics such as Perception Accuracy (PA) and Hallucination Resistance (HR) to evaluate the robustness of LMMs against hallucinatory tendencies. Across various probing scenarios, LMMs are assessed on their ability to accurately recognize and resist hallucinations regarding object/event existence in multimodal formats.

Through diligent evaluation of multiple state-of-the-art LMMs, the paper highlights crucial insights and limitations, emphasizing the prevalent challenge of multimodal integration balance. Notably, visual dominance and language biases are pinpointed as areas demanding targeted strategies to minimize hallucinations. Furthermore, the research suggests trajectories for future work harnessing balanced cross-modal learning, refined modality fusion techniques, and mitigation of linguistic priors inherent in pretraining.

Implications and Future Directions

The insights derived from this paper carry significant implications for both theoretical advancements and practical applications of LMMs. They underscore the requirement for robust multimodal learning frameworks capable of nuanced and accurate cross-modality interactions. By offering a detailed diagnostic approach, the benchmark provides a structured pathway for future research aimed at enhancing the reliability of LMMs across dynamic and real-world multimodal environments. Future directions may include developing datasets with balanced multimodal representations, enhancing cross-modal fusion mechanisms, and refining models' sensitivity towards cross-modal biases.

In summary, this paper constitutes an important step towards recognizing and addressing the challenges posed by hallucinations in LMMs, thereby contributing to the enhancement of multimodal AI's interpretative and interactive capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Sicong Leng (15 papers)
  2. Yun Xing (14 papers)
  3. Zesen Cheng (24 papers)
  4. Yang Zhou (311 papers)
  5. Hang Zhang (164 papers)
  6. Xin Li (980 papers)
  7. Deli Zhao (66 papers)
  8. Shijian Lu (151 papers)
  9. Chunyan Miao (145 papers)
  10. Lidong Bing (144 papers)
Citations (1)