Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation (2508.15370v1)

Published 21 Aug 2025 in cs.CL and cs.AI

Abstract: The trustworthiness of Multimodal LLMs (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MultiTrust-X, a unified benchmark and mitigation framework that systematically evaluates trustworthiness issues in multimodal LLMs.
  • It demonstrates that proprietary models outperform open-source ones in dimensions like truthfulness, robustness, and safety, highlighting significant trade-offs.
  • The RESA approach integrates chain-of-thought reasoning to narrow safety-utility trade-offs, achieving state-of-the-art results among open-source models.

Comprehensive Evaluation and Mitigation of Trustworthiness in Multimodal LLMs

Introduction

The proliferation of Multimodal LLMs (MLLMs) has enabled significant advances in tasks requiring integrated vision and language understanding. However, the deployment of MLLMs in real-world applications is hindered by unresolved trustworthiness issues, including hallucinations, adversarial vulnerabilities, privacy leakage, and biased or unsafe behaviors. The paper "Unveiling Trust in Multimodal LLMs: Evaluation, Analysis, and Mitigation" (2508.15370) addresses these challenges by introducing MultiTrust-X, a unified benchmark and framework for systematic evaluation, analysis, and mitigation of trustworthiness in MLLMs. The work further proposes a Reasoning-Enhanced Safety Alignment (RESA) approach, demonstrating state-of-the-art performance among open-source MLLMs.

MultiTrust-X: Framework and Benchmark Design

Three-Dimensional Trustworthiness Taxonomy

MultiTrust-X is constructed on a three-dimensional framework:

  1. Trustworthiness Aspects: Five primary dimensions—truthfulness, robustness, safety, fairness, and privacy—are defined, each further subdivided for granular analysis. This taxonomy captures both technical reliability and ethical integrity.
  2. Risk Types: The framework distinguishes between multimodal risks (arising from visual or cross-modal inputs) and cross-modal impacts (how visual context alters text-only task performance), addressing unique vulnerabilities introduced by multimodality.
  3. Mitigation Strategies: Methods are categorized by their locus in the ML pipeline: data, architecture, training, and inference algorithms. This enables structured analysis of mitigation efficacy and trade-offs.

Benchmark Implementation

MultiTrust-X comprises 32 tasks and 28 datasets, spanning generative and discriminative settings, and both multimodal and text-only formats. Datasets are sourced from established resources or constructed via synthetic generation and manual annotation to ensure coverage of all risk scenarios. Evaluation metrics are tailored to each task, employing both objective (e.g., accuracy, Pearson correlation) and subjective (e.g., toxicity scores, refusal rates) measures, normalized to a 0–100 scale.

Thirty MLLMs, including both proprietary (e.g., GPT-4V, Claude3) and open-source (e.g., LLaVA, Phi-3.5-Vision) models, are benchmarked. Eight representative mitigation methods are selected for controlled analysis, covering hallucination, robustness, and safety alignment.

Empirical Findings on Trustworthiness

Model Vulnerabilities

  • Proprietary vs. Open-Source: Proprietary models consistently outperform open-source counterparts across all trustworthiness aspects, attributed to more extensive safety alignment and risk mitigation during development.
  • Aspect Decoupling: There is only moderate correlation between general multimodal capability and trustworthiness. Sub-aspects such as truthfulness, toxicity mitigation, and privacy awareness are inter-correlated, but most other aspects are weakly coupled, necessitating independent evaluation.
  • Multimodal and Cross-Modal Risks: Multimodal fine-tuning and inference can degrade alignment inherited from base LLMs, amplifying vulnerabilities. Visual context can both enhance and destabilize model behavior, leading to unpredictable risk amplification.

Aspect-Specific Observations

  • Truthfulness: MLLMs perform well on coarse perception tasks but degrade on fine-grained visual grounding and cognitively demanding inference, especially when integrating visual information.
  • Robustness: All models are highly susceptible to adversarial perturbations, with attack success rates exceeding 50% in some cases. Larger or more specialized visual encoders confer some robustness, but do not eliminate vulnerabilities.
  • Safety: Open-source models are generally ineffective at refusing unsafe or NSFW content, and are easily jailbroken via visual or typographic attacks. Safety alignment in proprietary models is more robust but not infallible.
  • Fairness: Models exhibit high sensitivity to stereotypical queries, but topic-dependent variations persist, with certain stereotypes (e.g., age) more likely to elicit agreement.
  • Privacy: While models can identify private information in images, their reasoning about privacy is weak, and visual context increases the risk of privacy leakage, especially in conversation history.

Analysis of Mitigation Strategies

Efficacy and Trade-Offs

  • Narrow Focus: Most mitigation methods target isolated issues (e.g., hallucination, robustness) and do not generalize across trustworthiness dimensions. Improvements in one aspect often come at the expense of others.
  • Safety Alignment: Refusal-based safety alignment (e.g., VLGuard) improves risk sensitivity but induces over-refusal and utility degradation. Mixing helpfulness data with safety data only partially mitigates this trade-off.
  • Training Algorithms: Direct Preference Optimization (DPO) outperforms Supervised Fine-Tuning (SFT) in safety alignment and utility preservation, but both introduce a trade-off between safety and general performance.
  • Inference-Time Methods: Techniques such as ECSO and ETA provide modest safety improvements with minimal utility loss, but are fundamentally limited by the base model's intrinsic capabilities.

Data and Reasoning

  • Data Content and Quantity: High proportions of explicit refusal data enhance safety but increase over-refusal. Larger safety datasets improve refusal rates but degrade general capabilities.
  • Chain-of-Thought (CoT) Reasoning: Incorporating CoT-formatted data in safety alignment preserves general utility while maintaining safety performance. Explicit reasoning chains enable models to better identify risks and avoid harmful outputs.

Reasoning-Enhanced Safety Alignment (RESA)

RESA is introduced as a mitigation strategy that integrates CoT reasoning into safety alignment. Safety data (e.g., VLGuard) is reformatted to include explicit reasoning chains, and helpfulness data is similarly augmented. The model is further enhanced by replacing the visual encoder with a robust variant (FARECLIP).

Key empirical results:

  • RESA raises LLaVA-v1.5-7B's overall MultiTrust-X score from 48.45 to 69.66, surpassing Phi-3.5-Vision (66.29), previously the best open-source model.
  • The approach narrows the trustworthiness gap between open-source and proprietary MLLMs, achieving state-of-the-art results among open-source models.
  • The use of CoT reasoning in safety alignment is shown to mitigate the safety-utility trade-off more effectively than prior methods.

Implications and Future Directions

The findings underscore the multifaceted and persistent nature of trustworthiness challenges in MLLMs. The decoupling of trustworthiness from general capability, the amplification of risks through multimodal integration, and the limited generalizability of current mitigation strategies highlight the need for holistic, aspect-aware evaluation and alignment.

Practical implications:

  • Deployment of MLLMs in safety-critical or privacy-sensitive domains requires comprehensive, multi-aspect evaluation and robust mitigation strategies.
  • Safety alignment should move beyond refusal-based approaches to incorporate explicit reasoning and deliberative analysis, as demonstrated by RESA.
  • Future research should focus on developing alignment algorithms that jointly optimize safety and utility, leveraging advances in preference learning, reasoning, and robust representation learning.

Theoretical implications:

  • The taxonomy and benchmark design in MultiTrust-X provide a foundation for systematic paper of trustworthiness in multimodal systems.
  • The observed trade-offs and cross-modal risk amplification motivate further investigation into the interaction between modalities and alignment objectives.

Conclusion

This work establishes MultiTrust-X as a comprehensive framework for evaluating and improving the trustworthiness of MLLMs. The systematic analysis reveals that current models, especially open-source ones, remain vulnerable across multiple dimensions, and that existing mitigation strategies are insufficient for holistic trustworthiness. The introduction of RESA, leveraging chain-of-thought reasoning and robust visual encoders, demonstrates a promising direction for aligning safety and utility in multimodal systems. The framework and findings presented are expected to inform future research and development of trustworthy, reliable, and safe multimodal AI systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube