- The paper introduces MLLMGuard, a toolkit that evaluates MLLMs’ safety across dimensions like privacy, bias, toxicity, truthfulness, and legality.
- It leverages adversarial datasets and red teaming techniques to create challenging, real-world scenarios for testing model vulnerabilities.
- The study presents new metrics (ASD, PAR) and GuardRank, revealing significant safety gaps in current models and highlighting the need for improved alignment.
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal LLMs
Introduction
The paper "MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal LLMs" presents MLLMGuard, an evaluation framework focusing on the safety of Multimodal LLMs (MLLMs). The need arises because MLLMs are increasingly subjected to potential vulnerabilities due to complex application scenarios, exposing them to safety risks. Despite existing benchmarks, a comprehensive and rigorous evaluation is lacking. MLLMGuard aims to offer a robust, multi-faceted evaluation covering bilingual dimensions and multiple safety aspects such as Privacy, Bias, Toxicity, Truthfulness, and Legality.
Safety Dimensions and Dataset
MLLMGuard's Dimensions:
- Privacy: Evaluates if models respect privacy without disclosing personal, trade, or state secrets.
- Bias: Assesses models' capability to mitigate stereotypes, prejudice, and discrimination.
- Toxicity: Focuses on identifying and responsibly handling hate speech and harmful content.
- Truthfulness: Tests models for robustness against hallucinations and consistency in noisy conditions.
- Legality: Evaluates ability to handle legal queries responsibly, addressing personal and public security laws.
Dataset Construction:
The dataset comprises manually constructed adversarial examples, sourced mainly from social media, to avoid data leakage. It employs red teaming techniques to enhance complexity, ensuring models are tested against real-world, challenging scenarios.
Quality Control:
The dataset undergoes a rigorous quality control process to ensure relevance, accuracy in labeling, and necessity of multimodal data. This process includes a comprehensive review by experts to maintain high standards.
Evaluation Metrics and GuardRank
MLLMGuard introduces new metrics, namely:
- ASD (Attack Success Degree): Measures the harm level a model might allow from zero to full risk.
- PAR (Perfect Answer Rate): Assesses the rate of fully safe and responsible model responses.
For automation, the paper develops GuardRank, a lightweight and efficient evaluator surpassing GPT-4 in assessment accuracy. GuardRank is positioned as a plug-and-play solution for safety evaluations.
Experimental Evaluation
The evaluation covers 13 models, including GPT-4V, Gemini, and several open-source alternatives. The results underscore the significant gaps in safety capabilities among existing MLLMs:


Figure 2: Results on Truthfulness. (a) presents the ASD of MLLMs under various red teaming techniques on Truthfulness. (b) and (d) further display the ASD results on 2 red teaming techniques, i.e., Non-existent Query and Noise Injection.
Limitations and Future Work
Opportunities for enhancement include:
- Scalability: Addressing the scalability of manual dataset construction.
- Diversity: Incorporating additional languages and further dimension expansions.
- Software Updates: Iterative improvements to GuardRank for more robust, scalable evaluations.
Conclusion
The study emphasizes the critical need for comprehensive safety evaluation in MLLMs, advocating for improvements in alignment, robustness, and multicultural applicability. MLLMGuard serves not only as an evaluative benchmark but also as a stepping stone towards safer, more reliable AI deployments. Consequently, the paper establishes a foundation for future work in refining safety measures and aligning MLLMs more closely with human-centric values.
In summary, MLLMGuard addresses crucial gaps in current MLLM safety by providing a sophisticated toolkit to guide future improvements in AI safety and performance.